Learn to parse fasta file using python

I am learning python and I want to parse a fasta file without using BioPython. My txt file looks like this:



I would like to parse this to store the headers of each sequence that are> 22567 and> 34454 into a list of headers (this works). And after each heading, read the next sequence in the sequence listing.

Result, I would like to look like this:

headers =  ['>22567','>34454']


The problem is that when I try to read a part of the sequence, I cannot figure out how to concatenate each line into one line of the sequence before I add it to the list. Instead, I have each line added to the sequence list.

The code I have so far:


import re 

dna = []
sequences = []

def read_fasta(filename):
    global seq, header, dna, sequences 

#open the file  
    with open(filename) as file:    
        seq = ''        
        #forloop through the lines
        for line in file: 
            header = re.search(r'^>\w+', line)
            #if line contains the header '>' then append it to the dna list 
            if header:
                line = line.rstrip("\n")
            # in the else statement is where I have problems, what I would like is
                #the proceeding lines before the next '>' is the sequence for each header,
                #concatenate these lines into one string and append to the sequences list 
                seq = line.replace('\n', '')  

filename = 'gc.txt'




source to share

3 answers

Note. I had this solution in one of my projects, so I pasted it directly here. However, the solution is not mine and refers to this poster here . Please support his answer. Thanks to @donkeykong for finding the original post

Use a list to accumulate rows until you reach a new ID. Then concatenate the strings and store them with the id in the dictionary. The following function takes an open file and gives each pair (id, sequence).

def read_fasta(fp):
        name, seq = None, []
        for line in fp:
            line = line.rstrip()
            if line.startswith(">"):
                if name: yield (name, ''.join(seq))
                name, seq = line, []
        if name: yield (name, ''.join(seq))

with open('ex.fasta') as fp:
    for name, seq in read_fasta(fp):
        print(name, seq)





This was the answer on SO. I'll try to find it and give credit to the original poster.



Keep a temporary line and add to this line every time you have a sequence of parts. When you get a new title, add the string to the sequence list and set it to an empty string.



You can use groupby:

from itertools import groupby
with open("in.fasta") as f:
    groups = groupby(f, key=lambda x: not x.startswith(">"))
    d = {}
    for k,v in groups:
        if not k:
            key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
            d[key] = val



Or using a generator:

def grouped(fle):
    from itertools import groupby
    with open(fle) as f:
        groups = groupby(f, key=lambda x: not x.startswith(">"))
        for k, v in groups:
            if not k:
                yield list(v)[0].rstrip(), "".join(map(str.rstrip, next(groups)[1],""))





All Articles