Adding "#" before the first 8 lines corresponding to STRING

Question

Adding "#" before the first 8 lines corresponding to STRING

The question is a bit confusing, so I'll just show you an example.

Let's say I have the following case:

$ grep -P "locus_tag\tM715_1000193188" Genome.tbl -B1 -A8
193188  193066  gene
            locus_tag   M715_1000193188
193188  193066  mRNA
            product hypothetical protein
            protein_id  gnl|CorradiLab|M715_1000193188
            transcript_id   gnl|CorradiLab|M715_mrna1000193188
193188  193066  CDS
        product hypothetical protein
        protein_id  gnl|CorradiLab|M715_1000193188
        transcript_id   gnl|CorradiLab|M715_mrna1000193188

I want to add "#" on the 8 lines following "locus_tag M715_1000193188" so that the modified file looks like this:

193188  193066  gene
            locus_tag   M715_1000193188
#193188 193066  mRNA
#           product hypothetical protein
#           protein_id  gnl|CorradiLab|M715_1000193188
#           transcript_id   gnl|CorradiLab|M715_mrna1000193188
#193188 193066  CDS
#       product hypothetical protein
#       protein_id  gnl|CorradiLab|M715_1000193188
#       transcript_id   gnl|CorradiLab|M715_mrna1000193188

Essentially I have a file with ~ 3000 different locus tags, and for 300 of them I need to comment out the mRNA and CDS functions, so 8 lines following the locus_tag line.

Any possible way to do this with sed? There are other types of information in the file that must be left untouched.

Thanks, Adrian

+3

awk sed text-parsing

AdrianP. Apr 28 15 at 17:55

source to share

4 answers

Jotne · Answer 1 · 2015-04-28T18:01:42+0000

If you can use awk

this should do:

awk 'f&&f-- {$0="#"$0} /locus_tag/ {f=8} 1' file
193188  193066  gene
            locus_tag   M715_1000193188
#193188  193066  mRNA
#            product hypothetical protein
#            protein_id  gnl|CorradiLab|M715_1000193188
#            transcript_id   gnl|CorradiLab|M715_mrna1000193188
#193188  193066  CDS
#        product hypothetical protein
#        protein_id  gnl|CorradiLab|M715_1000193188
#        transcript_id   gnl|CorradiLab|M715_mrna1000193188

Etan reisner · Answer 2 · 2015-04-28T18:08:44+0000

sed supports the Addresses range , which can do what you want here.

sed -e '/locus_tag\tM715_1000193188/,+8s/^/#/' file

As noted in the comments, this range address format has the GNU sed specification.

Ed morton · Answer 3 · 2015-04-28T18:21:58+0000

$ cat tst.awk
BEGIN { split(tags,tmp); for (i in tmp) tagsA[tmp[i]] }
c&&c-- { $0 = "#" $0 }
($(NF-1) == "locus_tag") && ($NF in tagsA) { c=8 }
{ print }

$ awk -v tags="M715_1000193188 M715_1000193189 M715_1000193190" -f tst.awk file
193188  193066  gene
            locus_tag   M715_1000193188
#193188  193066  mRNA
#            product hypothetical protein
#            protein_id  gnl|CorradiLab|M715_1000193188
#            transcript_id   gnl|CorradiLab|M715_mrna1000193188
#193188  193066  CDS
#        product hypothetical protein
#        protein_id  gnl|CorradiLab|M715_1000193188
#        transcript_id   gnl|CorradiLab|M715_mrna1000193188

Just list all 300 locus tag values you care about as shown above for 3 examples.

potong · Answer 4 · 2015-04-28T18:30:07+0000

This might work for you (GNU sed):

sed 's/.*/\\#locus_tag\\s*&#,+9{\\#locus_tag\\s*&#n;s|^|#|}/' tag_file |
sed -i -f - file

This creates a sed script from the tag file and appends #

to eight lines after the tag matches.

Adding "#" before the first 8 lines corresponding to STRING

More articles: