Pipe inside the loop

Let me preface this with an apology, since I'm certainly not a coder, but where I need to use a .sh script (run in Git Bash on Windows platform - job requirements) develop a bioinformatics solution for my data.

I suspect my problem is related to issues with parent and subclass variables, but there are several anomalies. First, the script runs on startup minus the loop and does not parse the CSV file. If I have done < test.csv

, located immediately after the echo commands, the script works fine for the last line in my CSV file, but does not generate output files for the other lines. However, if done < test.csv

found at the end of my script, it generates the files it wants, renames and moves them (and even contains a user variable and a fetch variable from the loop), however they almost all contain no data.

Any help would be most appreciated. I have read many related questions carefully, however I have not been able to successfully implement their solutions.

Example .csv;

Sample,F_index,R_index
One,dog,cat
Two,dog,cat
Three,cat,dog

      

code

#!/bin/bash

echo "Hello - what is your input file, including file type?"
read -r var1
echo "Please enter user details (eg. name or initials)"
read -r var5

mkdir "$(date +"%Y-%b-%d")"
while IFS="," read -r Sample F_index R_index
do
    [ "$Sample" == "Sample" ] && continue
    echo "Sample : $Sample"
    echo "F_index : $F_index"
    echo "R_index : $R_index"
    grep -B 1 "$F_index" "$var1" \
        | sed "s/""$F_index""/&\\n/;s/.*\\n//" \
        | grep -B 1 --group-separator="$( )" "$R_index" \
        | sed "s/""$R_index"".*//" \
        | tee "$Sample"_trimmed.fa \
        && sed "/^\\s*$/d" "$Sample"_trimmed.fa \
        | sort \
        | uniq -c \
        | sort -nr \
        | sed "/^.*>/ d" \
        | tr -d " " \
        | sed "s/.*[0-9]/>&\\n/g" \
        | tee "$Sample"_deduplicated.fa \
        && sed "s/>//" "$Sample"_deduplicated.fa \
        | sed "/^[0-9]/{N;s/\\n//;}" \
        | sed "s/^\\(.*\\)\\(^[0-9]\\{1,4\\}\\)/\\2,\\1/" \
        | tee >(wc -l) \
        | sed 1i"Sample:,""$Sample""" \
        | sed 2i"User:,""$var5""" \
        | sed 3i"DATE:,$(date)" \
        | sed 4i"Frequency,Unique reads" \
        | tee "$Sample"_results.csv \
        | mv ./*deduplicated.fa ./"$(date +"%Y-%b-%d")" \
        | mv ./*trimmed.fa ./"$(date +"%Y-%b-%d")" \
        | mv ./*results.csv ./"$(date +"%Y-%b-%d")"
done < test.csv

      

+2


source to share


1 answer


As pointed out in the comments, there are some specific and some more general problems in your code. A common problem is that instead of using the proper specialized tools to solve problems, you are rewriting those tools from scratch, Bash, inefficient and naive. 1

So, the solution to all your problems: learn how to use existing tools. Unfortunately, the first step to do this is to find these tools, and the best way to do this is to read the method papers and take sequence analysis courses.

There are many options; heres a small selection . But for your specific purposes, I suggest using cutadapt for adapter trimming and biobambam for deduplication - however, I usually recommend using read deduplication, as this will underestimate your expression signal.




1 I say naively, but please don't take it personally: this is a truly impressive feat in Bash. But existing tools work much better, for example, when removing adapters with sequence errors, partial adapters, etc .; while your code will only find adapters if the whole adapter is present with no sequence errors. Therefore, your approach will unfortunately fail in many cases in the real world.

+6


source







All Articles