Read lines from file, grep in second file and dump the file for each line $

I have the following two files:

sequences.txt

158333741       Acaryochloris_marina_MBIC11017_uid58167 158333741       432     1       432     COG0001 0
158339504       Acaryochloris_marina_MBIC11017_uid58167 158339504       491     1       491     COG0002 0
379012832       Acetobacterium_woodii_DSM_1030_uid88073 379012832       430     1       430     COG0001 0
302391336       Acetohalobium_arabaticum_DSM_5501_uid51423      302391336       441     1       441     COG0003 0
311103820       Achromobacter_xylosoxidans_A8_uid59899  311103820       425     1       425     COG0004 0
332795879       Acidianus_hospitalis_W1_uid66875        332795879       369     1       369     COG0005 0
332796307       Acidianus_hospitalis_W1_uid66875        332796307       416     1       416     COG0005 0

      

allids.txt

COG0001
COG0002
COG0003
COG0004
COG0005

      

Now I want to read each line in allids.txt

, search for all lines in sequences.txt

(especially column 7), and write for each line

to a allids.txt

file with the filename $line

.

my approach is to use a simple grep:

while read line; do
  grep "$line" sequences.txt
done <allids.txt

      

but where can I include the command for output? If there is a team that is faster, feel free to suggest!

My expected output:

COG0001.txt

158333741       Acaryochloris_marina_MBIC11017_uid58167 158333741       432     1       432     COG0001 0
379012832       Acetobacterium_woodii_DSM_1030_uid88073 379012832       430     1       430     COG0001 0

      

COG0002.txt

158339504       Acaryochloris_marina_MBIC11017_uid58167 158339504       491     1       491     COG0002 0

      

[and so on]

+3


source to share


3 answers


I suspect you really need:

awk '{print > ($7".txt")}' sequences.txt

      



This suspicion is based on your named ids file allIds.txt

(note everything ) and there will be no ids in sequences.txt

that don't exist in allIds.txt

.

+2


source


It's quite easy to do this using awk

:

awk 'NR==FNR{ids[$1]; next} $7 in ids{print > ($7 ".txt")}' allids.txt sequences.txt

      



Ref: Effective AWK Programming

+5


source


Extending your approach seems to have worked:

while read line; do
  # touching is not necessary as pointed out by @123
  # touch "$line.txt" 
  grep "$line" sequences.txt > "$line.txt"
done <allids.txt

      

It creates text files with the required output. But I cannot comment on the effectiveness of this approach.

EDIT :

As noted in the comments, this method is slow and breaks for any file that violates the unreasonable assumptions used in the answer. I'm leaving this here to see how a quick and hacky solution can backfire.

-1


source







All Articles