Awk: how to extract from file A the columns with indexes specified in file B?

Question

Awk: how to extract from file A the columns with indexes specified in file B?

My file A has the following format, hundreds of thousands of columns and thousands of rows:

1000->100001    DOSE    2.000   2.000   2.000   2.000   2.000 ....
1001->100101    DOSE    1.988   1.988   2.000   2.000   2.000 ....
1001->100101    DOSE    1.933   2.000   2.000   2.000   2.000 ....
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000 ....
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000 ....

Although my file B (thousands of entries) has the following format:

SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,803428,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,803442,29649737,12
20:29649765,T,G,0.99963,0.00037,0.99963,0.13918,,803443,29649765,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,803456,29650462,12

What I want to do is extract from file A the first two columns plus the columns specified in column "key" from file B. I spent some time trying to figure out how to do this using awk, but it was taking too long for which in theory shouldn't be too difficult.

My specific question is, once I fetch the "key" entries from file B, how do I pass those values to the awk command for file A?

Manual command for first 3 columns in file B:

awk '{print $1, $2, $803428, $803442, $803442}' fileA > output.txt

edits:

There is no common column between file A and file B . The values in the column 803 428 in the A file will present the results for the SNP "20: 29649365" from the File Bed and .

If the command to run is:

awk '{print $1, $2, $3, $4, $5}' fileA

Result:

1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000
1002->100201 DOSE 1.933 1.999 2.000 2.000 2.000
1003->100301 DOSE 2.000 2.000 2.000 2.000 2.000
1004->100401 DOSE 2.000 2.000 2.000 2.000 2.000

Specific examples I use to test awk commands given as solutions.

FiLea:

1000->100001    DOSE    2.000   2.000   2.000   2.000   2.000
1001->100101    DOSE    1.988   1.988   2.000   2.000   2.000
1001->100101    DOSE    1.933   2.000   1.500   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000   2.000   1.622
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000

FILEB:

SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,3,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,4,29649737,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,6,29650462,12

Desired output (columns 1, 2, 3, 4, 6 are the last 3 from the key column of fileB):

1000->100001    DOSE    2.000   2.000   2.000
1001->100101    DOSE    1.988   1.988   2.000
1001->100101    DOSE    1.933   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000

+3

unix bash awk

Alexandra Dec 18 14 at 20:21

source to share

3 answers

if you want to read the column index from fileB (column key

, not pos

)) and print those columns from fileA, also preserve the col index order in fileB, you can try:

awk 'NR==FNR{c[NR]=$(NF-2);n=NR;next}
{printf "%s %s",$1,$2;
for(i=2;i<=n;i++)printf " %s",$c[i];print ""}' FS=',' fileB FS=' ' fileA

+3

Kent Dec 18 14 at 20:40

source to share

Alternative approach: Apply an awk script to file B, which generates another awk script to be applied to the file.

#!/bin/bash
awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB > cols.awk
awk -f cols.awk fileA

Or as a one-liner, with no intermediate file:

#!/bin/bash
awk "$(awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB)" fileA

+1

Ruud helderman Dec 18 14 at 22:53

source to share

amaurea · Accepted Answer · 2014-12-18T20:45:01+0000

Ok, here's an updated version that should reproduce your result.

awk 'ARGIND==2&&!/SNP/{cols[++i]=$9}ARGIND==4{printf("%s %s",$1,$2);
      for(j=1;j<=i;j++)printf(" %s%s",$cols[j],j<i?"":"\n");}' FS=',' B.txt FS='[ \t]+' A.txt

It complicated a bit the different field separators in the two files and the header in the B file that needs to be ignored. But the key here is that it $

accepts a variable, not just a constant integer. This approach stores a list of columns in an array cols

and iterates through them for each row in A.

Switching field separators between files is done with dummy files FS=','

and 'FS =' [\ t] + '', which means our actual files have arguments 2 and 4.

Awk: how to extract from file A the columns with indexes specified in file B?

More articles: