How to compare a specific part of a field between two files
I am struggling with the following task of comparing parts of a field ($ 3) between two files with fields separated by tabs. The files correspond to other $ 1-2 fields for the string, but $ 3 is something else entirely. I'm only interested in one part of $ 3, the numerical value for AF . All subfields (?) In $ 3 are separated by semi-colonies, but as you can see, the AF values are not always in position (sometimes # 2, sometimes # 3). I would like to highlight the lines where the values for AF in the third field differ between files.
For example, here's an example file1:
dmel_mitochondrion_genome 18984 AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19066 AB=0.684;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19074 AB=0.321;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19212 AC=8;AF=1.00;AN=8;DP=382;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19285 AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19384 AC=8;AF=1.00;AN=8;DP=400;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19395 AC=8;AF=1.00;AN=8;DP=398;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19461 AB=0.524;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19472 AB=0.527;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19475 AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$
and example file2:
dmel_mitochondrion_genome 18984 AB=0.730;AC=4;**AF=1.00**;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19066 AB=0.742;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19074 AB=0.345;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19212 AC=8;AF=1.00;AN=8;BaseQRankSum=1.722;DP=$
dmel_mitochondrion_genome 19285 AC=8;AF=0.50;AN=8;BaseQRankSum=1.721;DP=$
dmel_mitochondrion_genome 19384 AC=8;AF=1.00;AN=8;BaseQRankSum=1.458;DP=$
dmel_mitochondrion_genome 19395 AC=8;AF=1.00;AN=8;DP=391;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19461 AB=0.510;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19472 AB=0.526;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19475 AC=8;AF=0.50;AN=8;BaseQRankSum=-1.732;DP$
The output I would like to get is the following lines from file1:
dmel_mitochondrion_genome 18984 AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19285 AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19475 AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$
or even something like this:
dmel_mitochondrion_genome 18984 AF=0.50
dmel_mitochondrion_genome 19285 AF=1.00
dmel_mitochondrion_genome 19475 AF=1.00
I tried to use awk, but couldn't figure out how to compare parts of fields, not whole fields. I finally figured out how to use regular expressions to find the value for AF in each line from one file, but I don't know how to grab the value in order to compare it to another value from another file. Any help is greatly appreciated!
You can use an AWK array as a hash table to store the first occurrence of the AF value , and then compare it with the following prefix:
BEGIN { store[0] = 0 }
{
key = $1 "-" $2
match($3, /AF=[^;*]+/)
val = substr($3, RSTART+3, RLENGTH-3)
if ((key in store) && (store[key] != val))
print $1,$2,"AF=" store[key]
else
store[key] = val
}
However, the filter-then-diff solution seems to be more elegant because it has the potential to consume a lot of memory.
The following command should provide you with each file in the format you want. Then you could just do a diff on them ...
awk '{s=$0; split(s, a, "AF="); split(a[1], a1); split(a[2], a2, ";"); print a1[1] " " a1[2] " AF=" a2[1]}'
A slightly different approach
awk '{subList=$3;
sub(/.*AF=/, "AF=", subList); sub(/;.*$/, "", subList)
print $1 "\t" $2 "\t" subList}' awkTest_20120409_1.txt > awkTest_20120409_1_cln.txt
awk '{subList=$3;
sub(/.*AF=/, "AF=", subList); sub(/;.*$/, "", subList)
print $1 "\t" $2 "\t" subList}' awkTest_20120409_2.txt > awkTest_20120409_2_cln.txt
diff awkTest_20120409_1_cln.txt awkTest_20120409_2_cln.txt | grep '^<' | sed 's/< //'
** output **
dmel_mitochondrion_genome 18984 AF=0.50
dmel_mitochondrion_genome 19285 AF=1.00
dmel_mitochondrion_genome 19475 AF=1.00
Of course, you need to substitute your filenames for both input and output as well as targets diff
.
Hope this helps.
TXR: Made for molecular biologists working with sad text files. And other people too.
Since files can be huge, we avoid storing anything in memory from one pair of lines to the next (provided :vars ()
in the collect clause).
A bit of a trick is used here to make the two files look like one stream with alternating lines. We can then match the pattern along this stream as if it were a single file.
Variables (column 3 stuff) are parsed in Lisp association lists, so we can use it assoc
to look up the value of the field of interest. It is compared line by line; if necessary, conversion to a numeric value can be done so that the values 0.5 and 0.50 are considered the same.
@(next :args)
@(cases)
@file1
@file2
@(or)
@ (throw error "two file names needed")
@(end)
@;
@; functional programming trick: make a bottomless lazy list which returns
@; strings, which are the lines from files f1 and f2, alternating.
@;
@(do (defun make-interleaved-lazy-stream (f1 f2)
(let ((streams '#(,(open-file f1 "r") ,(open-file f2 "r"))))
(let ((toggle 0) line)
(gen (prog1
(set line (get-line [streams toggle]))
(set toggle (- 1 toggle)))
line)))))
@(define parse-line (gen id alist))
@gen @id @(coll)@{var /[A-Z]+/}=@val;@(end)
@ (bind alist @[mapcar cons var val])
@(end)
@(next :list @(make-interleaved-lazy-stream file1 file2))
@(collect :vars ())
@ (cases)
@ (parse-line gen id alist1)
@ (parse-line gen id alist2)
@ (or)
@ (throw error `assumption violated: mismatching lines between @file1 and @file2`)
@ (end)
@ (do (let ((AF1 (cdr (assoc "AF" alist1)))
(AF2 (cdr (assoc "AF" alist2))))
(if (not (equal AF1 AF2))
(put-string `@{gen 30} @{id -6} AF1=@AF1\n`))))
@(end)
Run:
$ txr gendiff.txr file1 file2
dmel_mitochondrion_genome 18984 AF1=0.50
dmel_mitochondrion_genome 19285 AF1=1.00
dmel_mitochondrion_genome 19475 AF1=1.00
This can help -
awk -F'[; =]' '
NR==FNR{ for (i=1;i<=NF;i++) if ($i=="AF") b[++x]=$(i+1); c[x]=$0; next }
{for (j=1;j<=NF;j++) if ($j=="AF") d[++y]=$(j+1)}
END {for (z=1;z<=length(b);z++) if (b[z]!=d[z]) print c[z]}' f1 f2
File 1:
[jaypal:~/Temp] cat f1
dmel_mitochondrion_genome 18984 AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19066 AB=0.684;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19074 AB=0.321;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19212 AC=8;AF=1.00;AN=8;DP=382;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19285 AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19384 AC=8;AF=1.00;AN=8;DP=400;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19395 AC=8;AF=1.00;AN=8;DP=398;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19461 AB=0.524;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19472 AB=0.527;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19475 AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$
File 2:
[jaypal:~/Temp] cat f2
dmel_mitochondrion_genome 18984 AB=0.730;AC=4;AF=1.00;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19066 AB=0.742;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19074 AB=0.345;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19212 AC=8;AF=1.00;AN=8;BaseQRankSum=1.722;DP=$
dmel_mitochondrion_genome 19285 AC=8;AF=0.50;AN=8;BaseQRankSum=1.721;DP=$
dmel_mitochondrion_genome 19384 AC=8;AF=1.00;AN=8;BaseQRankSum=1.458;DP=$
dmel_mitochondrion_genome 19395 AC=8;AF=1.00;AN=8;DP=391;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19461 AB=0.510;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19472 AB=0.526;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19475 AC=8;AF=0.50;AN=8;BaseQRankSum=-1.732;DP$
Test:
[jaypal:~/Temp] awk -F'[; =]' '
NR==FNR{ for (i=1;i<=NF;i++) if ($i=="AF") b[++x]=$(i+1); c[x]=$0; next }
{for (j=1;j<=NF;j++) if ($j=="AF") d[++y]=$(j+1)}
END {for (z=1;z<=length(b);z++) if (b[z]!=d[z]) print c[z]}' f1 f2
dmel_mitochondrion_genome 18984 AB=0.743;AC=4;AF=0.50;AN=8;BaseQRankSum=$
dmel_mitochondrion_genome 19285 AC=8;AF=1.00;AN=8;DP=342;DS;Dels=0.00;FS$
dmel_mitochondrion_genome 19475 AC=8;AF=1.00;AN=8;BaseQRankSum=0.936;DP=$
This might work for you:
awk -vfile2=file2 -vOFS='\t' '{sub(/.*AF=[^0-9.-]*/,"",$3);sub(/[^0-9.-]+.*/,"",$3);getline line <file2;sub(/.*AF=[^0-9.-]*/,"",line);sub(/[^0-9.-]+.*/,"",line)};$3!=line{$3="AF="$3;print}' file1
Since file1
and file2
only correspond to the values AF
:
- Read the line
file1
- Decrease
$3
toAF
value - Read a line
file2
into a variableline
- Decrease
line
toAF
value - Compare
$3
withline
and output$0
fromfile1
(after adding$3
cAF=
) if they don't match.