Learn to parse fasta file using python
I am learning python and I want to parse a fasta file without using BioPython. My txt file looks like this:
>22567
CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCA
ACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGG
GTGGTACCTATTA
>34454
AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCG
ATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTC
CTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTC
GTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCC
TCGGA
I would like to parse this to store the headers of each sequence that are> 22567 and> 34454 into a list of headers (this works). And after each heading, read the next sequence in the sequence listing.
Result, I would like to look like this:
headers = ['>22567','>34454']
sequences = ['CGTGTCCAGGTCTATCTCGGAAATT...', AAAACTTTGTGAAAA....']
The problem is that when I try to read a part of the sequence, I cannot figure out how to concatenate each line into one line of the sequence before I add it to the list. Instead, I have each line added to the sequence list.
The code I have so far:
#!/usr/bin/python
import re
dna = []
sequences = []
def read_fasta(filename):
global seq, header, dna, sequences
#open the file
with open(filename) as file:
seq = ''
#forloop through the lines
for line in file:
header = re.search(r'^>\w+', line)
#if line contains the header '>' then append it to the dna list
if header:
line = line.rstrip("\n")
dna.append(line)
# in the else statement is where I have problems, what I would like is
#else:
#the proceeding lines before the next '>' is the sequence for each header,
#concatenate these lines into one string and append to the sequences list
else:
seq = line.replace('\n', '')
sequences.append(seq)
filename = 'gc.txt'
read_fasta(filename)
source to share
Note. I had this solution in one of my projects, so I pasted it directly here. However, the solution is not mine and refers to this poster here . Please support his answer. Thanks to @donkeykong for finding the original post
Use a list to accumulate rows until you reach a new ID. Then concatenate the strings and store them with the id in the dictionary. The following function takes an open file and gives each pair (id, sequence).
def read_fasta(fp):
name, seq = None, []
for line in fp:
line = line.rstrip()
if line.startswith(">"):
if name: yield (name, ''.join(seq))
name, seq = line, []
else:
seq.append(line)
if name: yield (name, ''.join(seq))
with open('ex.fasta') as fp:
for name, seq in read_fasta(fp):
print(name, seq)
Output:
('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA')
('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')
This was the answer on SO. I'll try to find it and give credit to the original poster.
source to share
You can use groupby:
from itertools import groupby
with open("in.fasta") as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
print(d)
{'>34454': 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA', '>22567': 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA'}
Or using a generator:
def grouped(fle):
from itertools import groupby
with open(fle) as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
for k, v in groups:
if not k:
yield list(v)[0].rstrip(), "".join(map(str.rstrip, next(groups)[1],""))
print(list(grouped("in.fasta")))
[('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA'), ('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')]
source to share