Python: EOL search in file doesn't work

I have this method:

def get_chunksize(path):
    """
    Breaks a file into chunks and yields the chunk sizes.
    Number of chunks equals the number of available cores.
    Ensures that each chunk ends at an EOL.
    """
    size = os.path.getsize(path)
    cores = mp.cpu_count()
    chunksize = size/cores # gives truncated integer

    f = open(path)
    while 1:
        start = f.tell()
        f.seek(chunksize, 1) # Go to the next chunk
        s = f.readline() # Ensure the chunk ends at the end of a line
        yield start, f.tell()-start
        if not s:
            break

      

It is supposed to split the file into chunks and return the start of the chunk (in bytes) and the block size.

Basically the end of the chunk should match the end of the line (which is why there is behavior f.readline()

), but I find my chunks don't look for EOL at all.

The purpose of the method is to then read fragments that can be passed to an instance csv.reader

(via StringIO

) for further processing.

I haven't been able to spot anything clearly wrong with the function ... any ideas why isn't it moving towards EOL?

I came up with this rather clumsy alternative:

def line_chunker(path):
    size = os.path.getsize(path)
    cores = mp.cpu_count()
    chunksize = size/cores # gives truncated integer

    f = open(path)

    while True:
        part = f.readlines(chunksize)
        yield csv.reader(StringIO("".join(part)))
        if not part:
            break

      

This will split the file into chunks using a csv reader for each chunk, but the last chunk is always empty (??) and listing the lines back together is rather awkward.

+3


source to share


1 answer


if not s:
        break

      

Instead of looking at s

to see if you are at the end of the file, you should see if you have reached the end of the file using:

if size == f.tell(): break



this should fix it. I wouldn't depend on a CSV file that has one entry per line. I have worked with several CSV files that have newlines with them:

first,last,message
sue,ee,hello
bob,builder,"hello,
this is some text
that I entered"
jim,bob,I'm not so creative...

      

Note that the second entry (bob) spans 3 lines. csv.reader can handle this. If the idea is to do some CPU intensive work on the csv. I would create an array of streams, each with a buffer of n entries. ask the csv.reader to pass the write to each stream using round-robin, skipping the stream if its buffer is full.
Hope this helps - enjoy.

+1


source







All Articles