Efficient search using the ancestor-datastore path
We have a mulch lease and I need to search and retrieve in a huge appengine datastore based on an indexed range of attributes and a customer id. Does using Ancestor Paths take advantage of its effectiveness? Alternatively The same can be done using an additional filter
eg. get the top 100 salaries through objectification
Key<Clients> clientIdKey = Key.create(Clients.class, 500)
ofy().load().type(Salaries.class).ancestor(clientIdKey).order("-salary").limit(100).list()
Alternatively just
ofy().load().type(Salaries.class).filter("clientId = ", 500 ).order("-salary").limit(100).list()
My guess is that in the former case, all objects owned by any other client will be ignored, but in the later case, it will be a full scan, which will be more expensive. Is this assumption correct?
Also, is the salary index stored globally or is it split according to the ancestor so that the index update only happens inside the same ancestor? This will cut down on the time it takes to update the index and is a good solution when we never query different clients.
source to share
The first thing I need to point out is that the datastore is not doing table scans. With a few exceptions (most notably zigzag merges), GAE queries are tracked only on indexes - so the questions usually boil down to "which index is more efficient to maintain?"
Let's start by discussing the second case (note that I have highlighted Salary, which I assume is your intention):
ofy().load().type(Salary.class).filter("clientId = ", 500 ).order("-salary").limit(100).list()
This requires an index with multiple properties on Salary { clientId, salary } DESC
. GAE will traverse the index to the beginning Salary/clientId/500
and then read each index entry one at a time. It will do this on an index table in an arbitrary datacenter - and since those index tables are replicated asynchronously, you will get the final result.
For a subject to participate in the ambiguity index, each of the individual individual properties must be indexed. If Salary had no other indexed properties, recording one salary would be worth:
- 1 entry for entity operation
- 2 writes for index
clientId
(asc and desc) - 2 writes for index
salary
(asc and desc) - 1 write for index with multiple properties
{ clientId, salary } DESC
Now let's look at the first case:
ofy().load().type(Salary.class).ancestor(clientIdKey).order("-salary").limit(100).list()
Your datastore-indexes.xml requires a different index with multiple properties. This time, you need an index on Salary { ancestor, salary } DESC
. In addition, GAE's default behavior is to read from datacenter quorum to make it a strong sequential operation. This should be somewhat slower (albeit not more expensive) than the other method, however you can explicitly specify the final consistency to get the same "any datacenter" behavior: ofy().consistency(Consistency.EVENTUAL).load()...
It's nice that you have the option of strong consistency.
Another bonus of the ancestor approach is that you don't need to maintain a single property index on clientId
. Here's what happens when you write this salary (aside from other indexed fields):
- 1 entry for entity operation
- 2 writes for index
salary
(asc and desc) - 1 write for index with multiple properties
{ ancestor, salary } DESC
This can make your system significantly cheaper. The largest cost of indexes with multiple properties is often the cost of all (otherwise irrelevant) single-property bidirectional indexes, which you should support simply as a flag to GAE.
Regarding your last question, it might help explain what the GAE index tables look like. There are three BigTables for indexes that are used in all applications. The first two are index tables with one property (one for upstream, one for downstream). Their content looks something like this:
{appId}/{entityKind}/{propertyName}/{propertyValue}/{entityKey}
By performing a range scan (one of the primitive BigTable operations), you can determine which entities match your query. This is also because keyword queries are fast / cheap; you can immediately return the key without further searching.
An index table with multiple properties looks like (again, not accurate):
{appId}/{entityKind}/{prop1name}/{prop1value}/{prop2name}/{prop2value}/.../{entityKey}\
It might be easier to render with some values ββfor an index with multiple properties on Salary { clientId, salary } DESC
:
yourapp/Salary/clientId/500/salary/99000/aghzfnZvb3N0MHILCxIFRXZlbnQYAQw yourapp/Salary/clientId/500/salary/98000/aghttydiisgAJJ3JGS0ij44JFAjasdw
Again, you can see how by performing a range scan, GAE can find objects that match your queries.
I hope this helps to figure it out.
source to share