Awk: how to extract from file A the columns with indexes specified in file B?
My file A has the following format, hundreds of thousands of columns and thousands of rows:
1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000 ....
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000 ....
1001->100101 DOSE 1.933 2.000 2.000 2.000 2.000 ....
1002->100201 DOSE 2.000 2.000 2.000 2.000 2.000 ....
1002->100201 DOSE 2.000 2.000 2.000 2.000 2.000 ....
Although my file B (thousands of entries) has the following format:
SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,803428,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,803442,29649737,12
20:29649765,T,G,0.99963,0.00037,0.99963,0.13918,,803443,29649765,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,803456,29650462,12
What I want to do is extract from file A the first two columns plus the columns specified in column "key" from file B. I spent some time trying to figure out how to do this using awk, but it was taking too long for which in theory shouldn't be too difficult.
My specific question is, once I fetch the "key" entries from file B, how do I pass those values ββto the awk command for file A?
Manual command for first 3 columns in file B:
awk '{print $1, $2, $803428, $803442, $803442}' fileA > output.txt
edits:
There is no common column between file A and file B . The values in the column 803 428 in the A file will present the results for the SNP "20: 29649365" from the File Bed and .
If the command to run is:
awk '{print $1, $2, $3, $4, $5}' fileA
Result:
1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000
1002->100201 DOSE 1.933 1.999 2.000 2.000 2.000
1003->100301 DOSE 2.000 2.000 2.000 2.000 2.000
1004->100401 DOSE 2.000 2.000 2.000 2.000 2.000
Specific examples I use to test awk commands given as solutions.
FiLea:
1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000
1001->100101 DOSE 1.933 2.000 1.500 2.000 2.000
1002->100201 DOSE 2.000 2.000 2.000 2.000 1.622
1002->100201 DOSE 2.000 2.000 2.000 2.000 2.000
FILEB:
SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,3,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,4,29649737,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,6,29650462,12
Desired output (columns 1, 2, 3, 4, 6 are the last 3 from the key column of fileB):
1000->100001 DOSE 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000
1001->100101 DOSE 1.933 2.000 2.000
1002->100201 DOSE 2.000 2.000 2.000
1002->100201 DOSE 2.000 2.000 2.000
source to share
Ok, here's an updated version that should reproduce your result.
awk 'ARGIND==2&&!/SNP/{cols[++i]=$9}ARGIND==4{printf("%s %s",$1,$2);
for(j=1;j<=i;j++)printf(" %s%s",$cols[j],j<i?"":"\n");}' FS=',' B.txt FS='[ \t]+' A.txt
It complicated a bit the different field separators in the two files and the header in the B file that needs to be ignored. But the key here is that it $
accepts a variable, not just a constant integer. This approach stores a list of columns in an array cols
and iterates through them for each row in A.
Switching field separators between files is done with dummy files FS=','
and 'FS =' [\ t] + '', which means our actual files have arguments 2 and 4.
source to share
if you want to read the column index from fileB (column key
, not pos
)) and print those columns from fileA, also preserve the col index order in fileB, you can try:
awk 'NR==FNR{c[NR]=$(NF-2);n=NR;next}
{printf "%s %s",$1,$2;
for(i=2;i<=n;i++)printf " %s",$c[i];print ""}' FS=',' fileB FS=' ' fileA
source to share
Alternative approach: Apply an awk script to file B, which generates another awk script to be applied to the file.
#!/bin/bash
awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB > cols.awk
awk -f cols.awk fileA
Or as a one-liner, with no intermediate file:
#!/bin/bash
awk "$(awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB)" fileA
source to share