Collecting, manipulating and merging a dataset in Pandas / Python
There is a large dataset containing strings. I just want to open it with read_fwf with widths like:
widths = [3, 7, ..., 9, 7]
tp = pandas.read_fwf(file, widths=widths, header=None)
This helps me to tag the data , But the system crashes (works with nrows = 20,000). Then I decided to do it in a chunk (like 20,000 lines) like:
cs = 20000
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
...: <some code using chunk>
My question is, what should I do in a loop to concatenate (concatenate?) The chunks back to a CSV file after some chunk processing (marking a row, deleting or modifying a column)? Or is there another way?
source to share
I'm going to assume that by reading the whole file
tp = pandas.read_fwf(file, widths=widths, header=None)
fails, but reading in chunks works, that the file is too large to be read immediately, and that you are facing a MemoryError.
In this case, if you can process the data in chunks and then concatenate the results to CSV, you can use chunk.to_csv
to write the CSV in chunks:
filename = ...
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
# process the chunk
chunk.to_csv(filename, mode='a')
Note that it mode='a'
opens the file in append mode, so the output of each is
chunk.to_csv
appended to one file.
source to share