Optimization tips for reading / parsing large numbers of JSON.gz files

I have an interesting problem. As someone who's new when it comes to working with data even on a decent scale, I'd love some advice from the veterans here.

I have 6000

Json.gz files totaling about 5GB and 20GB uncompressed. I open each file and read them line by line using a module gzip

; then using json.loads()

, loading each line and parsing the complex JSON structure. Then I insert lines from each file into Pytable at the same time before moving on to the next file.

All this takes me about 3 hours. Bulk nesting in Pytable didn't really help speed at all. Much of the time was spent getting the values ​​from the parsed JSON string, as they have a really terrible structure. Some of them are as simple as 'attrname':attrvalue

, but some of them are complex and time consuming, for example:

'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]

... where I need to collect value

all those objects in an array attr

that have some relevant ones name

, and ignore those that don't. So I need to iterate over the list and check each JSON object inside. (I would be glad if you could point out a faster smarter way if it exists)

So, I suppose the actual parsing part doesn't have much speedup. Where I think they might be an area of ​​acceleration is actually reading part of the file.

So, I ran some tests (I don't have numbers with me now) and even after removing some of the parsing of my program; simply iterating over the files line by line alone took a significant amount of time.

So my question is, is there some part of this problem that you think I can do suboptimally?

for filename in filenamelist:
    f = gzip.open(filename):
    toInsert=[]
    for line in f:
        parsedline = json.loads(line)
        attr1 = parsedline['attr1']
        attr2 = parsedline['attr2']
        .
        .
        .
        attr10 = parsedline['attr10']
        arr = parsedline['attrarray']
        for el in arr:
            try:
                if el['name'] == 'abc':
                    attrABC = el['value']
                elif el['name'] == 'xyz':
                    attrXYZ = el['value']
                .
                .
                .
            except KeyError:
                pass
        toInsert.append([attr1,attr2,...,attr10,attrABC,attrXYZ...])

    table.append(toInsert)

      

+3


source to share


1 answer


One clean piece of "low hanging fruit"

If you will be accessing the same compressed files over and over again (it is not particularly clear from your description if this is a one-time operation), then you should unpack them once, rather than unpack them in on-fly mode every time. when you read them.

Decompression is CPU intensive and the Python Module is gzip

not as fast
as compared to C utilities like zcat

/ gunzip

.

Probably the fastest approach to gunzip

all of these files, save the results somewhere and then read from the uncompressed files into the script.



Other problems

The rest of this is not really an answer, but too long for a comment. To make it faster, you need to think about a few other questions:

  • What are you trying to accomplish with all this data?
  • Do you really need to download all of this at once?
    • If you can segment data into smaller chunks, you can reduce program latency, if not total time. For example, you may know that you only need some specific lines from certain files for whatever analysis you are trying to do ... great! Download only those specific lines.
    • If you need to access data in arbitrary and unpredictable ways, you must load it into another system (RDBMS?) That stores it in a format that is more amenable to the kinds of analyzes you do with it.

If the last bullet point is true, one option is to load each JSON "document" into a PostgreSQL 9.3 database ( JSON support is amazing and fast ) and then do further analyzes from there. Hopefully you can retrieve meaningful keys from JSON documents as they are loaded.

+3


source







All Articles