Read data efficiently in python (one line only)

Question

Read data efficiently in python (one line only)

For the upcoming programming competition, I have solved several problems from the previous competition. Each task looks like this: we get a bunch of in files (each containing 1 line of numbers and lines, fe "2 15 test 23 ..."), and we have to build the program and return some calculated values.

These files can be quite large: for example, 10 MB. My code is as follows:

with open(filename) as f:
    input_data = f.read().split()

It's pretty slow. I am most concerned with the separation method. Is there a faster way?

+3

python string python-3.x io

user38034 01 oct. 14 at 10:32

source to share

2 answers

wim · Answer 1 · 2014-10-01T10:51:27+0000

What you already think is the best way to enter text in a text file in a one line file.

10MB of plain text is quite large, if you need some more speedup you might consider etching the data in binary instead of plain text. Or, if it's very repetitive data, you can keep it compressed.

goncalopp · Answer 2 · 2014-10-01T11:13:11+0000

If one of your input files contains independent tasks (i.e. you can work with multiple line tokens at a time without knowing the tokens further), you can do reading and processing in blocking mode while looking at the entire file at the same time.

def read_groups(f):
    chunksize= 4096 #how many bytes to read from the file at once
    buf= f.read(chunksize)
    while buf:
        if entire_group_inside(buf): #checks if you have enough data to process on buf
            i= next_group_index(buf) #returns the index on the next group of tokens
            group, buf= buf[:i], buf[i:]
            yield group
        else:
            buf+= f.read(chunksize)

with open(filename) as f:
    for data in read_groups(f):
        #do something

This has some advantages:

You don't have to read the entire file in memory (which for 10MB on the desktop probably doesn't matter much)
If you do a lot of processing on each group of tokens, it can lead to better performance as you have interleaving I / O and CPU tasks. Modern OSs use sequential file prefetching to optimize linear file access, so in practice, if you block I / O and CPU, your I / O will run parallel to the OS. Even if your OS doesn't have this functionality, if you have a modern disk it will probably cache sequential block accesses.

If you don't have that much processing, your task is mostly I / O related and you cannot do that to speed it up as that means what wim said - apart from rethinking your input data the format

Read data efficiently in python (one line only)

More articles: