Optimization tips for reading / parsing large numbers of JSON.gz files

Question

Optimization tips for reading / parsing large numbers of JSON.gz files

I have an interesting problem. As someone who's new when it comes to working with data even on a decent scale, I'd love some advice from the veterans here.

I have 6000

Json.gz files totaling about 5GB and 20GB uncompressed. I open each file and read them line by line using a module gzip

; then using json.loads()

, loading each line and parsing the complex JSON structure. Then I insert lines from each file into Pytable at the same time before moving on to the next file.

All this takes me about 3 hours. Bulk nesting in Pytable didn't really help speed at all. Much of the time was spent getting the values from the parsed JSON string, as they have a really terrible structure. Some of them are as simple as 'attrname':attrvalue

, but some of them are complex and time consuming, for example:

'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]

... where I need to collect value

all those objects in an array attr

that have some relevant ones name

, and ignore those that don't. So I need to iterate over the list and check each JSON object inside. (I would be glad if you could point out a faster smarter way if it exists)

So, I suppose the actual parsing part doesn't have much speedup. Where I think they might be an area of acceleration is actually reading part of the file.

So, I ran some tests (I don't have numbers with me now) and even after removing some of the parsing of my program; simply iterating over the files line by line alone took a significant amount of time.

So my question is, is there some part of this problem that you think I can do suboptimally?

for filename in filenamelist:
    f = gzip.open(filename):
    toInsert=[]
    for line in f:
        parsedline = json.loads(line)
        attr1 = parsedline['attr1']
        attr2 = parsedline['attr2']
        .
        .
        .
        attr10 = parsedline['attr10']
        arr = parsedline['attrarray']
        for el in arr:
            try:
                if el['name'] == 'abc':
                    attrABC = el['value']
                elif el['name'] == 'xyz':
                    attrXYZ = el['value']
                .
                .
                .
            except KeyError:
                pass
        toInsert.append([attr1,attr2,...,attr10,attrABC,attrXYZ...])

    table.append(toInsert)

+3

json python

user1265125 10 Sep 14 at 20:24

source to share

1 answer

Dan lenski · Answer 1 · 2014-09-10T20:45:41+0000

One clean piece of "low hanging fruit"

If you will be accessing the same compressed files over and over again (it is not particularly clear from your description if this is a one-time operation), then you should unpack them once, rather than unpack them in on-fly mode every time. when you read them.

Decompression is CPU intensive and the Python Module is gzip

not as fast as compared to C utilities like zcat

/ gunzip

.

Probably the fastest approach to gunzip

all of these files, save the results somewhere and then read from the uncompressed files into the script.

Optimization tips for reading / parsing large numbers of JSON.gz files

One clean piece of "low hanging fruit"

Other problems

More articles: