Improve NDB Query Performance

Question

Improve NDB Query Performance

I'm looking for advice on how I can improve this in terms of speed:

My data model:

class Events(ndb.Model):
    eventid = ndb.StringProperty(required=True)
    participants = ndb.StringProperty(repeated=True)

How I am trying to get the data:

def GetEventDataNotCached(eventslist):
    futures = []
    for eventid in eventslist:
        if eventid is not None:
            ke = database.Events.query(database.Events.eventid == eventid)
            future = ke.get_async(keys_only = True)
            futures.append(future)

    eventskeys = []
    for future in futures:
        eventkey = future.get_result()  
        eventskeys.append(eventkey)

    data = ndb.get_multi(eventskeys)

So, I get async keys and pass the keys to "get_multi" - is there any other way to make it faster, since I'm still not happy with the performance.

A repeated property can have up to several hundred lines. There are several 10,000 lines in the Events model. There are only a few dozen events in the event list that I want to receive.

+3

optimization python google-app-engine app-engine-ndb

Sebastian Küpers 13 Feb 13 at 19:48

source to share

2 answers

JasonC · Answer 1 · 2013-02-15T17:37:44+0000

I found that the deserialization overhead from the long list protocol buffer (i.e. large properties repeated=True

) is very poor.

Have you considered this in appstats? Do you see a large white space where RPC fails after yours get_multi()

? This is the overhead of deserialization.

The only way I've been able to overcome this is to remove the long lists and manage them in a separate model (i.e. avoid long repeating property lists altogether), but of course this might not be possible for your use case.

So the big question is, do you really need all the participants when you get the list of events, or can you postpone this search in some way? For example, it might be cheaper / faster to receive all events synchronously and then remove asynchronous fetch for participants for each event (from a different model) and pool in memory - perhaps you only need the 25 most recently registered participants or something, thus is it possible to limit the cost of your subqueries?

tesdal · Answer 2 · 2013-02-14T00:02:23+0000

Ease and speed improvements, but not cost:

data = database.Events.query(database.Events.eventid.IN(eventslist)).fetch(100)

The next step is to have eventid as an id in the key, created as

event = Event(id=eventid, ...)

in this case you

data = ndb. get_multi(ndb.Key(Event, eventid) for eventid in eventlist)

Which is faster and longer (list of events) * 6 times cheaper.

Improve NDB Query Performance

More articles: