Can anyone walk through this awk code for merging multiple files?
I am using awk to combine multiple (> 3) files and I want to keep the headers. I found a previous post that does exactly what I need, but I am not quite sure what is going on. I was hoping that someone could walk through me so I could learn from it! (I tried commenting on the original post but didn't have enough reputation)
This code
awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f*
converts input files as desired. See Sample Tables below.
Input files:
file1.txt:
id value1
a 10
b 30
c 50
file2.txt:
id value2
a 90
b 30
c 20
file3.txt:
id value3
a 0
b 1
c 25
desired output
merge.txt:
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
Here's the code again
awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f* > merge.txt
I find it difficult to understand the first part of the code {a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}
, but understand the loop in the second part of the code.
I think the first part of the code creates an array. The code goes through and checks id
if the records in the first column match, and if there is a match, add the second column ( $2
) value
and print the entire record ( $0
).
But ... I don't understand the initial syntax. When is it established that the first column is the id
same in all three files and only to add the second column?
source to share
This code is wrong and unnecessarily complicated, use this instead:
$ awk 'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
print the output to the -t column for alignment if you like:
$ awk 'NR==FNR{a[NR]=$0;next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 | column -t
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
If you need to undo id
(for example because they differ across files) this would be:
$ awk '
BEGIN { OFS="\t" }
!($1 in a) { ids[++numIds]=$1 }
{ a[$1][ARGIND]=$2 }
END {
for (i=1;i<=numIds;i++) {
id = ids[i]
printf "%s%s", id, OFS
for (j=1;j<=ARGIND;j++) {
printf "%s%s", a[id][j], (j<ARGIND ? OFS : ORS)
}
}
}
' file1 file2 file3 | column -s$'\t' -t
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 25
x 20
The last script used GNU awk for multidimensional arrays and just c
changed to x
in input file2 to test it.
Feel free to ask if you have any questions, but I THINK the code is pretty clear.
source to share
First the data:
file1 file2 file3
NR FNR $1 $2 NR FNR $1 $2 NR FNR $1 $2
================ ================ ================
1 1 id value1 5 1 id value2 9 1 id value3
2 2 a 10 6 2 a 90 10 2 a 0
3 3 b 30 7 3 b 30 11 3 b 1
4 4 c 50 8 4 c 20 12 4 c 25
The first part: a[FNR]=( (a[FNR]) ? a[FNR]FS$2 : $0 )
can be written as:
if(a[FNR]=="") # actually if(a[FNR]=="" || a[FNR]==0)
a[FNR]=$0 # a[FNR] is "id value1" when NR==1
else
a[FNR]=a[FNR] FS $2 # a[FNR]="id value1" FS "value2" when NR==5
Each file has 4 entries, i.e. FNR==4
in the last entry of every file, especially the last file, since the value FNR
remains after the last file has been processed:
END { # after hashing all record in all files
for(i=1;i<=FNR;i++) # i=1, 2, 3, 4
print a[i] # print "id value1 value value3" etc.
}
source to share
James explained awk logic pretty well in his answer .
If you are looking for an alternative, this is a solution based on paste
:
paste file1 file2 file3 | awk '{print $1, $2, $4, $6}' OFS='\t'
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
source to share
FNR is the number of records relative to the current input file. So the line number in file1, file2, etc. http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/?ref=binfind.com/web
What? is a ternary operator and says if there is already something in [FNR], then add $ 2 of the current record to what is there, otherwise it is zero so keep the entire record (ie $ 0).
Pseudocode that might help explain things:
if a[FNR] != ""
a[FNR] = a[FNR] : FS : $2
else
a[FNR] = $0
You can see that a, b, c from each entry after deleting the first file - there could be x, y, z and this program doesn't care. It takes the second field and adds to [2], [3], etc.
source to share
You can use awk
with pr
for this:
$ pr -mts$'\t' f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
(These are tabs between columns)
Or use the paste
same:
$ paste f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
source to share