Python: EOL search in file doesn't work
I have this method:
def get_chunksize(path): """ Breaks a file into chunks and yields the chunk sizes. Number of chunks equals the number of available cores. Ensures that each chunk ends at an EOL. """ size = os.path.getsize(path) cores = mp.cpu_count() chunksize = size/cores # gives truncated integer f = open(path) while 1: start = f.tell() f.seek(chunksize, 1) # Go to the next chunk s = f.readline() # Ensure the chunk ends at the end of a line yield start, f.tell()-start if not s: break
It is supposed to split the file into chunks and return the start of the chunk (in bytes) and the block size.
Basically the end of the chunk should match the end of the line (which is why there is behavior
), but I find my chunks don't look for EOL at all.
The purpose of the method is to then read fragments that can be passed to an instance
) for further processing.
I haven't been able to spot anything clearly wrong with the function ... any ideas why isn't it moving towards EOL?
I came up with this rather clumsy alternative:
def line_chunker(path): size = os.path.getsize(path) cores = mp.cpu_count() chunksize = size/cores # gives truncated integer f = open(path) while True: part = f.readlines(chunksize) yield csv.reader(StringIO("".join(part))) if not part: break
This will split the file into chunks using a csv reader for each chunk, but the last chunk is always empty (??) and listing the lines back together is rather awkward.
source to share
if not s: break
Instead of looking at
to see if you are at the end of the file, you should see if you have reached the end of the file using:
if size == f.tell(): break
this should fix it. I wouldn't depend on a CSV file that has one entry per line. I have worked with several CSV files that have newlines with them:
first,last,message sue,ee,hello bob,builder,"hello, this is some text that I entered" jim,bob,I'm not so creative...
Note that the second entry (bob) spans 3 lines. csv.reader can handle this. If the idea is to do some CPU intensive work on the csv. I would create an array of streams, each with a buffer of n entries. ask the csv.reader to pass the write to each stream using round-robin, skipping the stream if its buffer is full.
Hope this helps - enjoy.
source to share