Can anyone walk through this awk code for merging multiple files?

I am using awk to combine multiple (> 3) files and I want to keep the headers. I found a previous post that does exactly what I need, but I am not quite sure what is going on. I was hoping that someone could walk through me so I could learn from it! (I tried commenting on the original post but didn't have enough reputation)

This code

awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f*

      

converts input files as desired. See Sample Tables below.

Input files:

file1.txt:

id    value1
a     10
b     30
c     50

      

file2.txt:

id    value2
a     90
b     30
c     20

      

file3.txt:

id    value3
a     0
b     1
c     25

      

desired output

merge.txt:

id    value1  value2  value3
a     10      90      0
b     30      30      1
c     50      20      25

      

Here's the code again

awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f* > merge.txt

      

I find it difficult to understand the first part of the code {a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}

, but understand the loop in the second part of the code.

I think the first part of the code creates an array. The code goes through and checks id

if the records in the first column match, and if there is a match, add the second column ( $2

) value

and print the entire record ( $0

).

But ... I don't understand the initial syntax. When is it established that the first column is the id

same in all three files and only to add the second column?

+3


source to share


5 answers


This code is wrong and unnecessarily complicated, use this instead:

$ awk 'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3
id    value1 value2 value3
a     10 90 0
b     30 30 1
c     50 20 25

      

print the output to the -t column for alignment if you like:

$ awk 'NR==FNR{a[NR]=$0;next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 | column -t
id  value1  value2  value3
a   10      90      0
b   30      30      1
c   50      20      25

      



If you need to undo id

(for example because they differ across files) this would be:

$ awk '
    BEGIN { OFS="\t" }
    !($1 in a) { ids[++numIds]=$1 }
    { a[$1][ARGIND]=$2 }
    END {
        for (i=1;i<=numIds;i++) {
            id = ids[i]
            printf "%s%s", id, OFS
            for (j=1;j<=ARGIND;j++) {
                printf "%s%s", a[id][j], (j<ARGIND ? OFS : ORS)
            }
        }
    }
' file1 file2 file3 | column -s$'\t' -t
id  value1  value2  value3
a   10      90      0
b   30      30      1
c   50              25
x           20

      

The last script used GNU awk for multidimensional arrays and just c

changed to x

in input file2 to test it.

Feel free to ask if you have any questions, but I THINK the code is pretty clear.

+4


source


First the data:

file1              file2              file3

NR FNR $1 $2       NR FNR $1 $2       NR FNR $1 $2
================   ================   ================
1  1   id value1   5  1   id value2   9  1   id value3
2  2   a  10       6  2   a  90       10 2   a  0
3  3   b  30       7  3   b  30       11 3   b  1
4  4   c  50       8  4   c  20       12 4   c  25

      

The first part: a[FNR]=( (a[FNR]) ? a[FNR]FS$2 : $0 )

can be written as:



if(a[FNR]=="")           # actually if(a[FNR]=="" || a[FNR]==0)
    a[FNR]=$0            # a[FNR] is "id value1" when NR==1
else
    a[FNR]=a[FNR] FS $2  # a[FNR]="id value1" FS "value2" when NR==5

      

Each file has 4 entries, i.e. FNR==4

in the last entry of every file, especially the last file, since the value FNR

remains after the last file has been processed:

END {                    # after hashing all record in all files
    for(i=1;i<=FNR;i++)  # i=1, 2, 3, 4
        print a[i]       # print "id value1 value value3" etc.
}

      

+4


source


James explained awk logic pretty well in his answer .

If you are looking for an alternative, this is a solution based on paste

:

paste file1 file2 file3 | awk '{print $1, $2, $4, $6}' OFS='\t'

id  value1  value2  value3
a   10  90  0
b   30  30  1
c   50  20  25

      

+2


source


FNR is the number of records relative to the current input file. So the line number in file1, file2, etc. http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/?ref=binfind.com/web

What? is a ternary operator and says if there is already something in [FNR], then add $ 2 of the current record to what is there, otherwise it is zero so keep the entire record (ie $ 0).

Pseudocode that might help explain things:

if a[FNR] != ""
  a[FNR] = a[FNR] : FS : $2
else
  a[FNR] = $0

      

You can see that a, b, c from each entry after deleting the first file - there could be x, y, z and this program doesn't care. It takes the second field and adds to [2], [3], etc.

+1


source


You can use awk

with pr

for this:

$ pr -mts$'\t' f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3) 
id    value1    value2  value3
a     10    90  0
b     30    30  1
c     50    20  25

      

(These are tabs between columns)

Or use the paste

same:

$ paste f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id    value1    value2  value3
a     10    90  0
b     30    30  1
c     50    20  25

      

+1


source







All Articles