Pandas Concatenate multiple CSV and output into one large file

So, currently I have a directory, we will call it / mydir, which contains 36 CSV files, each of which is 2.1 GB and the same sizes. They are the same size and I want to read them into pandas, concatenate them together side by side (so the number of lines remains the same), and then output the resulting frame as one big csv. The code I have for this works for combining several of them, but after some error reaches a memory error. I was wondering if there is a more efficient way to do this than what I have.

df = pd.DataFrame()
for file in os.listdir('/mydir'):
    df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')

      

I was asked to split it into smaller pieces, combine the files into groups of 6 and then combine them one at a time, but I don't know if this is a valid solution to avoid the memory error issue

EDIT: directory view:

-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv

      

+3


source to share


2 answers


Chunk Them All!



from glob import glob
import os

# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')

# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
    return pd.read_csv(f, index_col=0, chunksize=n)

# simplify the concatenation
def concat(lot):
    return pd.concat(lot, axis=1)

# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
    if os.path.exists(f):
        mode = 'a'
        header = False
    else:
        mode = 'w'
        header = True
    df.to_csv(f, mode=mode, header=header)

# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
    to_csv('out.csv', concat(lot))

      

+1


source


Assuming MaxU's answer is that all files have the same number of lines, and assuming minor CSV differences such as quoting are done the same across all files, you don't need to do this with Pandas. A regular file readlines

will provide you with lines that you can concatenate and write out. Suppose further that you can specify the number of lines. Something like this code:



    numrows = 999 # whatever.  Probably pass as argument to function or on cmdline
    out_file = open('myout.csv','w')
    infile_names = [ 'file01.csv',
                     'file02.csv',
                      ..
                     'file36.csv' ]

    # open all the input files
    infiles = []
    for fname in infile_names:
        infiles.append(open(fname))

    for i in range(numrows):
        # read a line from each input file and add it to the output string
        out_csv=''
        for infile2read in infiles:
            out_csv += infile2read.readline().strip() + ','
        out_csv[-1] = '\n' # replace final comma with newline

        # write this rows data out to the output file
        outfile.write(out_csv)

    #close the files
    for f in infiles:
        f.close()
    outfile.close()

      

0


source







All Articles