Collecting, manipulating and merging a dataset in Pandas / Python

There is a large dataset containing strings. I just want to open it with read_fwf with widths like:

widths = [3, 7, ..., 9, 7]
tp = pandas.read_fwf(file, widths=widths, header=None)


This helps me to tag the data , But the system crashes (works with nrows = 20,000). Then I decided to do it in a chunk (like 20,000 lines) like:

cs = 20000
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
...:  <some code using chunk>


My question is, what should I do in a loop to concatenate (concatenate?) The chunks back to a CSV file after some chunk processing (marking a row, deleting or modifying a column)? Or is there another way?


I'm going to assume that by reading the whole file

fails, but reading in chunks works, that the file is too large to be read immediately, and that you are facing a MemoryError.

In this case, if you can process the data in chunks and then concatenate the results to CSV, you can use chunk.to_csv

to write the CSV in chunks:

filename = ...
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
    # process the chunk
    chunk.to_csv(filename, mode='a')


Note that it mode='a'

opens the file in append mode, so the output of each is chunk.to_csv

appended to one file.



