Learn to parse fasta file using python

I am learning python and I want to parse a fasta file without using BioPython. My txt file looks like this:

>22567
CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCA
ACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGG
GTGGTACCTATTA
>34454
AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCG
ATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTC
CTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTC
GTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCC
TCGGA

      

I would like to parse this to store the headers of each sequence that are> 22567 and> 34454 into a list of headers (this works). And after each heading, read the next sequence in the sequence listing.

Result, I would like to look like this:

headers =  ['>22567','>34454']
sequences = ['CGTGTCCAGGTCTATCTCGGAAATT...', AAAACTTTGTGAAAA....']  

      

The problem is that when I try to read a part of the sequence, I cannot figure out how to concatenate each line into one line of the sequence before I add it to the list. Instead, I have each line added to the sequence list.

The code I have so far:

#!/usr/bin/python 

import re 

dna = []
sequences = []


def read_fasta(filename):
    global seq, header, dna, sequences 

#open the file  
    with open(filename) as file:    
        seq = ''        
        #forloop through the lines
        for line in file: 
            header = re.search(r'^>\w+', line)
            #if line contains the header '>' then append it to the dna list 
            if header:
                line = line.rstrip("\n")
                dna.append(line)            
            # in the else statement is where I have problems, what I would like is
            #else: 
                #the proceeding lines before the next '>' is the sequence for each header,
                #concatenate these lines into one string and append to the sequences list 
            else:               
                seq = line.replace('\n', '')  
                sequences.append(seq)      

filename = 'gc.txt'

read_fasta(filename)

      

+3


source to share


3 answers


Note. I had this solution in one of my projects, so I pasted it directly here. However, the solution is not mine and refers to this poster here . Please support his answer. Thanks to @donkeykong for finding the original post

Use a list to accumulate rows until you reach a new ID. Then concatenate the strings and store them with the id in the dictionary. The following function takes an open file and gives each pair (id, sequence).

def read_fasta(fp):
        name, seq = None, []
        for line in fp:
            line = line.rstrip()
            if line.startswith(">"):
                if name: yield (name, ''.join(seq))
                name, seq = line, []
            else:
                seq.append(line)
        if name: yield (name, ''.join(seq))

with open('ex.fasta') as fp:
    for name, seq in read_fasta(fp):
        print(name, seq)

      

Output:



('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA')
('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')

      


This was the answer on SO. I'll try to find it and give credit to the original poster.

+5


source


Keep a temporary line and add to this line every time you have a sequence of parts. When you get a new title, add the string to the sequence list and set it to an empty string.



+2


source


You can use groupby:

from itertools import groupby
with open("in.fasta") as f:
    groups = groupby(f, key=lambda x: not x.startswith(">"))
    d = {}
    for k,v in groups:
        if not k:
            key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
            d[key] = val

print(d)
{'>34454': 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA', '>22567': 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA'}

      

Or using a generator:

def grouped(fle):
    from itertools import groupby
    with open(fle) as f:
        groups = groupby(f, key=lambda x: not x.startswith(">"))
        for k, v in groups:
            if not k:
                yield list(v)[0].rstrip(), "".join(map(str.rstrip, next(groups)[1],""))


print(list(grouped("in.fasta")))
[('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA'), ('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')]

      

0


source







All Articles