Collecting, manipulating and merging a dataset in Pandas / Python

There is a large dataset containing strings. I just want to open it with read_fwf with widths like:

widths = [3, 7, ..., 9, 7]
tp = pandas.read_fwf(file, widths=widths, header=None)


This helps me to tag the data , But the system crashes (works with nrows = 20,000). Then I decided to do it in a chunk (like 20,000 lines) like:

cs = 20000
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
...:  <some code using chunk>


My question is, what should I do in a loop to concatenate (concatenate?) The chunks back to a CSV file after some chunk processing (marking a row, deleting or modifying a column)? Or is there another way?


source to share

1 answer

I'm going to assume that by reading the whole file

tp = pandas.read_fwf(file, widths=widths, header=None)


fails, but reading in chunks works, that the file is too large to be read immediately, and that you are facing a MemoryError.

In this case, if you can process the data in chunks and then concatenate the results to CSV, you can use chunk.to_csv

to write the CSV in chunks:

filename = ...
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
    # process the chunk
    chunk.to_csv(filename, mode='a')


Note that it mode='a'

opens the file in append mode, so the output of each is chunk.to_csv

appended to one file.



All Articles