Best way to split a huge file in Python

I need to split a very large file (3 GB) ten times like this: the first split is split between the first 10% of the lines and the rest of the file, the second is split between the second 10% of the line and the rest, etc. (This is done for cross-validation)

I did it naively by loading the lines of the file into a list, looping through the list and writing each line to the right output file by its index. It takes too long as it writes 3GB of data every time.

Is there a better way to do this?

Note: adding #

to the beginning of each line is like removing. Would it be wiser to add and remove #

before the beginning of the lines at the beginning?

EXAMPLE: if the file is [1,2,3,4,5,6,7,8,9,10] then I want to split it like this:

[1] and [2,3,4,5,6,7,8,9,10]
[2] and [1,3,4,5,6,7,8,9,10]
[3] and [1,2,4,5,6,7,8,9,10]

      

etc.

+3


source to share


1 answer


I suggest trying to reduce the number of files. Even though 30GB isn't too much with modern drives, it still takes a huge amount of effort (and time) to process.

For example:



  • Assuming you want 10% of the lines, not 10% of the size, you can create an index file at the start of each line and access the (single, original) text file using the index

  • You can also convert the original file to a fixed record file so that each text line is the same size. You can then select access using the search function ().

Both of these functions can be "hidden" by defining a file object in Python. This way you can access a single file as multiple "virtual" files, each just showing the part (or parts) you want.

+1


source







All Articles