Python - reduce import and parsing time for large CSV files

My first post:

Before I start, I must point out that I am relatively new to OOP, although I have worked with DB / stat in SAS, R, etc., so my question may not be entirely correct: please let me know if I need to clarify something.

My question is:

I am trying to import and parse large CSV files (~ 6MM lines and more will likely come). The two limitations I ran into were runtime and memory (32-bit Python implementation). Below is a simplified version of my neophyte (nth) attempt to import and parse in a reasonable amount of time. How can this process be accelerated? ... I am splitting the file as I import and execute intermediate summaries due to memory constraints and using pandas for the summary:

Parsing and summing:

def ParseInts(inString):
    try:
        return int(inString)
    except:
        return None
def TextToYearMo(inString):
    try:
        return 100*inString[0:4]+int(inString[5:7])
    except:
        return 100*inString[0:4]+int(inString[5:6])
def ParseAllElements(elmValue,elmPos):
    if elmPos in [0,2,5]:
        return elmValue
    elif elmPos == 3:
        return TextToYearMo(elmValue)
    else:
        if elmPos == 18:
            return ParseInts(elmValue.strip('\n'))
        else:
            return ParseInts(elmValue)

def MakeAndSumList(inList):
    df = pd.DataFrame(inList, columns = ['x1','x2','x3','x4','x5',
                                         'x6','x7','x8','x9','x10',
                                         'x11','x12','x13','x14'])
    return df[['x1','x2','x3','x4','x5',
               'x6','x7','x8','x9','x10',
               'x11','x12','x13','x14']].groupby(
               ['x1','x2','x3','x4','x5']).sum().reset_index()

      

Function calls:

def ParsedSummary(longString,delimtr,rowNum):
    keepColumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]

    #Do some other stuff that takes very little time

    return [pse.ParseAllElements(longString.split(delimtr)[i],i) for i in keepColumns]

def CSVToList(fileName, delimtr=','):
    with open(fileName) as f:
        enumFile = enumerate(f)
        listEnumFile = set(enumFile)
        for lineCount, l in enumFile:
            pass

        maxSplit = math.floor(lineCount / 10) + 1

        counter = 0
        Summary = pd.DataFrame({}, columns = ['x1','x2','x3','x4','x5',
                                              'x6','x7','x8','x9','x10',
                                              'x11','x12','x13','x14'])
        for counter in range(0,10):
            startRow     = int(counter * maxSplit)
            endRow       = int((counter + 1) * maxSplit)
            includedRows = set(range(startRow,endRow))

            listOfRows = [ParsedSummary(row,delimtr,rownum) 
                            for rownum, row in listEnumFile if rownum in includedRows]
            Summary = pd.concat([Summary,pse.MakeAndSumList(listOfRows)])

            listOfRows = []
            counter += 1
    return Summary

      

(Again, this is my first question, so I apologize if I've simplified too much, or more likely too little, but I don't understand how to speed this up.)

To compare run times:

Using Access, I can import, parse, summarize, and merge multiple files in this size range in <5 minutes (although I'm right at the 2GB level). I hope I can get comparable results in Python - I am currently estimating ~ 30 minutes runtime per file. Note. I threw something together in a pathetic Access environment because I didn't have admin rights available to install anything else.

Edit: Updated syntax code. Was able to shave off five minutes (estimated runtime at 25m) by modifying some conditional logic to try / rule out. Also - the runtime estimate does not include the pandas part - I forgot I commented on this during testing, but its impact seems negligible.

+3


source to share


1 answer


If you want to optimize for performance, don't roll your own CSV reader in Python. A standard csv

module already exists . Possibly, pandas

or numpy

have a faster csv reading ability; I'm not sure.

From https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file :



In short, pandas.io.parsers.read_csv

superior to all others, NumPy loadtxt

impressive slow and NumPy from_file

and load

impressively fast.

+1


source







All Articles