Reading lines from HUGE text files in groups of 4

I've been facing a python problem since a few days. I am bioinformatics with no basic programming skills and I work with huge text files (about 25GB) that I need to process.

I need to read a txt file line by line in groups of 4 lines at a time, which means the first 4 lines have to be read and processed, and then I have to read the second group of 4 lines, and so on.

Obviously I cannot use the readlines () operator because it will overload my memory and I have to use each of the 4 lines for some line recognition.

I was thinking about using a for loop with a range operator:

openfile = open(path, 'r')

for elem in range(0, len(openfile), 4):

line1 = readline()
line2 = readline()
line3 = readline()
line4 = readline()
(process lines...)

      

Unfortunately, this is not possible because a file in read mode cannot be iterated over and treated like a list or dictionary.

Can anyone help with the correct loop?

Thank you in advance

+3


source to share


5 answers


There is a method for lazy reading of large files in Python here . You can use this approach and process four lines at a time. Note that there is no need to do four reads and then do your processing and then four reads again. You can read chunks of several hundred or thousands of lines from a file and then process four lines at a time. When you are done with these lines, you can continue reading the contents of the file.



+2


source


It has low memory overhead. It expects the file to be an iterator that reads line by line.

def grouped(iterator, size):
    yield tuple(next(iterator) for _ in range(size))

      

Use it like this:



for line1, line2, line3, line4 in grouped(your_open_file, size=4):
    do_stuff_with_lines()

      

note: This code assumes that the file does not end with a partial group.

+5


source


You are reading the fastq file, right? You are most likely reinventing the wheel - you can just use Biopython , it has tools for working with common biology files. For example see this tutorial , for something with fastq files - it basically looks like this:

from Bio import SeqIO
for record in SeqIO.parse("SRR020192.fastq", "fastq"):
    # do something with record, using record.seq, record.id etc

      

Read more about SeqRecord objects for biopts here .

Here is another tutorial on fast processing of biotots, including an option to speed things up using a lower level library, for example:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")):
    # do things with title,seq,qual values

      

There's also an HTSeq package with the deeper sequencing tools I use the most.

By the way, if you don't already know about Biostar , you can take a look - this is a StackExchange site specifically for bioinformatics.

+3


source


You can use an infinite loop and exit when you reach the end of the file.

while True:
    line1 = f.readline()
    if not line1:
        break

    line2 = f.readline()
    line3 = f.readline()
    line4 = f.readline()
    # process lines

      

+2


source


Here's a way to do it that I can't take responsibility for, but it's perfectly reasonable:

for name, seq, comment, qual in itertools.izip_longest(*[openfile]*4):
    print name
    print seq
    print comment
    print qual

      

0


source







All Articles