Parse file and use some of fields as variables using header as name in bash

I have a file, the first line of which contains a series of fields separated by tabs ( \t

). I am trying to step through the lines and use some of the fields as variables for a program. The code I have so far is the following:

    {
    A=$(head -1 id_table.txt)
read;
    while IFS='\t' read $A; 
    do
        echo 'downloading '$SRA_Sample_s
        echo $tissue_s
    #out_dir=`echo $tissue_s | sed 's/ /./g'` #Replacing spaces by dots
    #/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s 
    done 
    } <./id_table.txt

      

Output (False):

downloading _s Inser

downloading  provided> <no

downloading  provided> <no

downloading  provided> <no

      

It fails because it does not handle fields correctly. Perhaps symbols <>

are confusing? Column names are ordered differently in different files, and some columns are missing from some files. I am stuck here.

The file looks like this:

BioSample_s MBases_l    MBytes_l    Run_s   SRA_Sample_s    Sample_Name_s   age_s   breed_s sex_s   Assay_Type_s    AssemblyName_s  BioProject_s    BioSampleModel_s    Center_Name_s   Consent_s   InsertSize_l    Library_Name_s  Platform_s  SRA_Study_s biomaterial_provider_s  g1k_analysis_group_s    g1k_pop_code_s  source_s    tissue_s
SAMN02777951    4698    3249    SRR1287653  SRS607026   SL01    19  SL01    female  RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood
SAMN02777952    4451    3063    SRR1287654  SRS607028   XB01    12  XB01    male    RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood
SAMN02777953    4553    3139    SRR1287655  SRS607025   XB02    6   XB02    female  RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood

      

+3


source to share


3 answers


You may find an awk script more reliable and less cumbersome to use than a shell:

$ cat tst.awk
BEGIN { FS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
    print "downloading", $(f["SRA_Sample_s"])
    out_dir = $(f["tissue_s"])
    gsub(/ /,".",out_dir)
    cmd = sprintf( "/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir %s --ncbi_error_report %s", out_dir, $(f["SRA_Sample_s"]) )
    print cmd
    #system(cmd); close(cmd)
}

      

...

$ awk -f tst.awk file
downloading SRR1287653
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287653
downloading SRR1287654
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287654
downloading SRR1287655
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287655

      

I would say that you should DEFINITELY avoid the shell loop unless you were calling an external command and therefore were doing more than just text processing.



Alternatively, consider using awk to process text and then connecting to a shell outline to execute an external command:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
    gsub(/ /,".",$(f["tissue_s"]))
    print $(f["tissue_s"]), $(f["SRA_Sample_s"])
}

      

...

$ awk -f tst.awk file |
while IFS=$'\t' read -r out_dir SRA_Sample_s
do
    printf 'downloading %s\n' "$SRA_Sample_s"
    #/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s 
done
downloading SRR1287653
downloading SRR1287654
downloading SRR1287655

      

+1


source


IFS='\t'

didn't work the way you wanted. This is a limitation on t

. Use IFS=$'\t'

to use tabs.

This is why you get _s Inser

etc. (note that it starts and turns off in the letter t

).



I believe I completely agree with EdMorton that using awk to do this is probably the best idea, although I believe that with careful quoting and the assertion that the tab will not appear in the input file, you can only do this safely with using the shell (but Ed showed me the error of my initial thoughts on more than one occasion, so he can think very well about what I don't know).

+3


source


try it (based on your development style)

cat id_table.txt \
 | {
   read Header

   while eval "read ${Header}"
    do
      echo "Donwloading ${SRA_Sample_s}"
      echo "${tissue_s}"
    done
   }

      

+1


source







All Articles