Parse file and use some of fields as variables using header as name in bash
I have a file, the first line of which contains a series of fields separated by tabs ( \t
). I am trying to step through the lines and use some of the fields as variables for a program. The code I have so far is the following:
{
A=$(head -1 id_table.txt)
read;
while IFS='\t' read $A;
do
echo 'downloading '$SRA_Sample_s
echo $tissue_s
#out_dir=`echo $tissue_s | sed 's/ /./g'` #Replacing spaces by dots
#/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s
done
} <./id_table.txt
Output (False):
downloading _s Inser
downloading provided> <no
downloading provided> <no
downloading provided> <no
It fails because it does not handle fields correctly. Perhaps symbols <>
are confusing? Column names are ordered differently in different files, and some columns are missing from some files. I am stuck here.
The file looks like this:
BioSample_s MBases_l MBytes_l Run_s SRA_Sample_s Sample_Name_s age_s breed_s sex_s Assay_Type_s AssemblyName_s BioProject_s BioSampleModel_s Center_Name_s Consent_s InsertSize_l Library_Name_s Platform_s SRA_Study_s biomaterial_provider_s g1k_analysis_group_s g1k_pop_code_s source_s tissue_s
SAMN02777951 4698 3249 SRR1287653 SRS607026 SL01 19 SL01 female RNA-Seq <not provided> PRJNA247712 Model organism or animal SICHUAN UNIVERSITY public 200 <not provided> ILLUMINA SRP041998 Chengdu Research Base of Giant Panda Breeding <not provided> <not provided> <not provided> blood
SAMN02777952 4451 3063 SRR1287654 SRS607028 XB01 12 XB01 male RNA-Seq <not provided> PRJNA247712 Model organism or animal SICHUAN UNIVERSITY public 200 <not provided> ILLUMINA SRP041998 Chengdu Research Base of Giant Panda Breeding <not provided> <not provided> <not provided> blood
SAMN02777953 4553 3139 SRR1287655 SRS607025 XB02 6 XB02 female RNA-Seq <not provided> PRJNA247712 Model organism or animal SICHUAN UNIVERSITY public 200 <not provided> ILLUMINA SRP041998 Chengdu Research Base of Giant Panda Breeding <not provided> <not provided> <not provided> blood
source to share
You may find an awk script more reliable and less cumbersome to use than a shell:
$ cat tst.awk
BEGIN { FS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
print "downloading", $(f["SRA_Sample_s"])
out_dir = $(f["tissue_s"])
gsub(/ /,".",out_dir)
cmd = sprintf( "/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir %s --ncbi_error_report %s", out_dir, $(f["SRA_Sample_s"]) )
print cmd
#system(cmd); close(cmd)
}
...
$ awk -f tst.awk file
downloading SRR1287653
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287653
downloading SRR1287654
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287654
downloading SRR1287655
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287655
I would say that you should DEFINITELY avoid the shell loop unless you were calling an external command and therefore were doing more than just text processing.
Alternatively, consider using awk to process text and then connecting to a shell outline to execute an external command:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
gsub(/ /,".",$(f["tissue_s"]))
print $(f["tissue_s"]), $(f["SRA_Sample_s"])
}
...
$ awk -f tst.awk file |
while IFS=$'\t' read -r out_dir SRA_Sample_s
do
printf 'downloading %s\n' "$SRA_Sample_s"
#/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s
done
downloading SRR1287653
downloading SRR1287654
downloading SRR1287655
source to share
IFS='\t'
didn't work the way you wanted. This is a limitation on t
. Use IFS=$'\t'
to use tabs.
This is why you get _s Inser
etc. (note that it starts and turns off in the letter t
).
I believe I completely agree with EdMorton that using awk to do this is probably the best idea, although I believe that with careful quoting and the assertion that the tab will not appear in the input file, you can only do this safely with using the shell (but Ed showed me the error of my initial thoughts on more than one occasion, so he can think very well about what I don't know).
source to share