How to compare a specific part of a field between two files

I am struggling with the following task of comparing parts of a field ($ 3) between two files with fields separated by tabs. The files correspond to other $ 1-2 fields for the string, but $ 3 is something else entirely. I'm only interested in one part of $ 3, the numerical value for AF . All subfields (?) In $ 3 are separated by semi-colonies, but as you can see, the AF values ​​are not always in position (sometimes # 2, sometimes # 3). I would like to highlight the lines where the values ​​for AF in the third field differ between files.

For example, here's an example file1:

dmel_mitochondrion_genome       18984   AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19066   AB=0.684;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19074   AB=0.321;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19212   AC=8;AF=1.00;AN=8;DP=382;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19285   AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19384   AC=8;AF=1.00;AN=8;DP=400;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19395   AC=8;AF=1.00;AN=8;DP=398;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19461   AB=0.524;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19472   AB=0.527;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19475   AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$

      

and example file2:

dmel_mitochondrion_genome       18984   AB=0.730;AC=4;**AF=1.00**;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19066   AB=0.742;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19074   AB=0.345;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19212   AC=8;AF=1.00;AN=8;BaseQRankSum=1.722;DP=$
dmel_mitochondrion_genome       19285   AC=8;AF=0.50;AN=8;BaseQRankSum=1.721;DP=$
dmel_mitochondrion_genome       19384   AC=8;AF=1.00;AN=8;BaseQRankSum=1.458;DP=$
dmel_mitochondrion_genome       19395   AC=8;AF=1.00;AN=8;DP=391;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19461   AB=0.510;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19472   AB=0.526;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19475   AC=8;AF=0.50;AN=8;BaseQRankSum=-1.732;DP$

      

The output I would like to get is the following lines from file1:

dmel_mitochondrion_genome       18984   AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19285   AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19475   AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$

      

or even something like this:

dmel_mitochondrion_genome       18984   AF=0.50  
dmel_mitochondrion_genome       19285   AF=1.00  
dmel_mitochondrion_genome       19475   AF=1.00

      

I tried to use awk, but couldn't figure out how to compare parts of fields, not whole fields. I finally figured out how to use regular expressions to find the value for AF in each line from one file, but I don't know how to grab the value in order to compare it to another value from another file. Any help is greatly appreciated!

+3


source to share


6 answers


You can use an AWK array as a hash table to store the first occurrence of the AF value , and then compare it with the following prefix:

BEGIN { store[0] = 0 }
{
    key = $1 "-" $2
    match($3, /AF=[^;*]+/)
    val = substr($3, RSTART+3, RLENGTH-3)
    if ((key in store) && (store[key] != val))
        print $1,$2,"AF=" store[key]
    else
        store[key] = val
}

      



However, the filter-then-diff solution seems to be more elegant because it has the potential to consume a lot of memory.

+2


source


The following command should provide you with each file in the format you want. Then you could just do a diff on them ...



awk '{s=$0; split(s, a, "AF="); split(a[1], a1); split(a[2], a2, ";"); print a1[1] " " a1[2] " AF=" a2[1]}'

      

+4


source


A slightly different approach

awk '{subList=$3; 
 sub(/.*AF=/, "AF=", subList); sub(/;.*$/, "", subList)
 print $1 "\t" $2 "\t" subList}'  awkTest_20120409_1.txt > awkTest_20120409_1_cln.txt

awk '{subList=$3; 
 sub(/.*AF=/, "AF=", subList); sub(/;.*$/, "", subList)
 print $1 "\t" $2 "\t" subList}'  awkTest_20120409_2.txt > awkTest_20120409_2_cln.txt

diff awkTest_20120409_1_cln.txt awkTest_20120409_2_cln.txt | grep '^<' | sed 's/< //'

      

** output **

dmel_mitochondrion_genome       18984   AF=0.50
dmel_mitochondrion_genome       19285   AF=1.00
dmel_mitochondrion_genome       19475   AF=1.00

      

Of course, you need to substitute your filenames for both input and output as well as targets diff

.

Hope this helps.

+2


source


TXR: Made for molecular biologists working with sad text files. And other people too.

Since files can be huge, we avoid storing anything in memory from one pair of lines to the next (provided :vars ()

in the collect clause).

A bit of a trick is used here to make the two files look like one stream with alternating lines. We can then match the pattern along this stream as if it were a single file.

Variables (column 3 stuff) are parsed in Lisp association lists, so we can use it assoc

to look up the value of the field of interest. It is compared line by line; if necessary, conversion to a numeric value can be done so that the values ​​0.5 and 0.50 are considered the same.

@(next :args)
@(cases)
@file1
@file2
@(or)
@  (throw error "two file names needed")
@(end)
@;
@; functional programming trick: make a bottomless lazy list which returns
@; strings, which are the lines from files f1 and f2, alternating.
@;
@(do (defun make-interleaved-lazy-stream (f1 f2)
       (let ((streams '#(,(open-file f1 "r") ,(open-file f2 "r"))))
         (let ((toggle 0) line)
           (gen (prog1
                  (set line (get-line [streams toggle]))
                  (set toggle (- 1 toggle)))
                line)))))
@(define parse-line (gen id alist))
@gen @id @(coll)@{var /[A-Z]+/}=@val;@(end)
@  (bind alist @[mapcar cons var val])
@(end)
@(next :list @(make-interleaved-lazy-stream file1 file2))
@(collect :vars ())
@  (cases)
@    (parse-line gen id alist1)
@    (parse-line gen id alist2)
@  (or)
@    (throw error `assumption violated: mismatching lines between @file1 and @file2`)
@  (end)
@  (do (let ((AF1 (cdr (assoc "AF" alist1)))
             (AF2 (cdr (assoc "AF" alist2))))
         (if (not (equal AF1 AF2))
           (put-string `@{gen 30} @{id -6}   AF1=@AF1\n`))))
@(end)

      

Run:

$ txr gendiff.txr file1 file2
dmel_mitochondrion_genome       18984   AF1=0.50
dmel_mitochondrion_genome       19285   AF1=1.00
dmel_mitochondrion_genome       19475   AF1=1.00

      

+1


source


This can help -

awk -F'[; =]' '
NR==FNR{ for (i=1;i<=NF;i++) if ($i=="AF") b[++x]=$(i+1); c[x]=$0; next } 
{for (j=1;j<=NF;j++) if ($j=="AF") d[++y]=$(j+1)} 
END {for (z=1;z<=length(b);z++) if (b[z]!=d[z]) print c[z]}' f1 f2

      

File 1:

[jaypal:~/Temp] cat f1
dmel_mitochondrion_genome       18984   AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19066   AB=0.684;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19074   AB=0.321;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19212   AC=8;AF=1.00;AN=8;DP=382;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19285   AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19384   AC=8;AF=1.00;AN=8;DP=400;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19395   AC=8;AF=1.00;AN=8;DP=398;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19461   AB=0.524;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19472   AB=0.527;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19475   AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$

      

File 2:

[jaypal:~/Temp] cat f2
dmel_mitochondrion_genome       18984   AB=0.730;AC=4;AF=1.00;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19066   AB=0.742;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19074   AB=0.345;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19212   AC=8;AF=1.00;AN=8;BaseQRankSum=1.722;DP=$
dmel_mitochondrion_genome       19285   AC=8;AF=0.50;AN=8;BaseQRankSum=1.721;DP=$
dmel_mitochondrion_genome       19384   AC=8;AF=1.00;AN=8;BaseQRankSum=1.458;DP=$
dmel_mitochondrion_genome       19395   AC=8;AF=1.00;AN=8;DP=391;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19461   AB=0.510;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19472   AB=0.526;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19475   AC=8;AF=0.50;AN=8;BaseQRankSum=-1.732;DP$

      

Test:

[jaypal:~/Temp] awk -F'[; =]' '
NR==FNR{ for (i=1;i<=NF;i++) if ($i=="AF") b[++x]=$(i+1); c[x]=$0; next } 
{for (j=1;j<=NF;j++) if ($j=="AF") d[++y]=$(j+1)} 
END {for (z=1;z<=length(b);z++) if (b[z]!=d[z]) print c[z]}' f1 f2
dmel_mitochondrion_genome       18984   AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome       19285   AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome       19475   AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$

      

0


source


This might work for you:

awk -vfile2=file2 -vOFS='\t' '{sub(/.*AF=[^0-9.-]*/,"",$3);sub(/[^0-9.-]+.*/,"",$3);getline line <file2;sub(/.*AF=[^0-9.-]*/,"",line);sub(/[^0-9.-]+.*/,"",line)};$3!=line{$3="AF="$3;print}' file1

      

Since file1

and file2

only correspond to the values AF

:

  • Read the line file1

  • Decrease $3

    to AF

    value
  • Read a line file2

    into a variableline

  • Decrease line

    to AF

    value
  • Compare $3

    with line

    and output $0

    from file1

    (after adding $3

    c AF=

    ) if they don't match.
0


source







All Articles