AWS SDK CloudSearch pagination
I am using PHP AWS SDK to communicate with CloudSearch. According to this post , pagination can be done using parameters cursor
or start
. But when you have more than 10,000 hits, you cannot use start
.
On use, start
I can point ['start' => 1000, 'size' => 100]
to go to page 10.
How do I get to the 1000th page (or any other random page) using cursor
? Maybe there is a way to calculate this parameter?
source to share
I would LOVE there to be the best way, but here goes ...
One thing I found with cursors is that they return the same value for repeated searches when searching in the same dataset, so don't think of them as sessions. Until your data is updated, you can effectively cache aspects of your pagination for use by multiple users.
I came up with this solution and tested it with 75,000+ entries.
1) Determine if your start will be below 10k limit if so use search without cursor, otherwise when searching past 10K, first search with cursor initial
and size 10K and return _no_fields
. This gives us a starting offset and no fields speed up the amount of data we have to consume, we don't need those IDs anyway
2) Figure out the target offset and indicate how many iterations it will take to position the cursor directly in front of your landing page results. Then I iterate and cache the results using my query as a cache hash.
For my iteration, I started with 10K blocks and then scaled down to 5k and then 1k blocks as when I start to "get closer" to the target offset, it means that subsequent pagination uses the previous cursor, which is slightly closer to the last chunk.
for example what it might look like:
- Fetch 10000 Records (start cursor)
- Fetch 5000 records
- Fetch 5000 records
- Fetch 5000 records
- Fetch 5000 records
- Fetch 1000 Records
- Fetch 1000 Records
This will help me get to the block that will contain the offset of 32,000. If I then need to get to 33,000, I can use my cached results to get a cursor that will return the previous 1000 and start again at that offset ...
- Write 10000 records (in cache)
- Fetch 5000 records (cached)
- Fetch 5000 records (cached)
- Fetch 5000 records (cached)
- Fetch 5000 records (cached)
- Write 1000 records (in cache)
- Write 1000 records (in cache)
- Write 1000 records (works using a cursor in the cache)
3) Now that we are in the "neighborhood" of your target result, you can start specifying page sizes just in front of your destination. and then you do your final search to get the actual results page.
4) If you add or remove documents from your index, you will need a mechanism to invalidate your previous cached results. I did this by storing the timestamp of when the index was last updated and using that as part of the cache key generation procedure.
The aspect of the cache is important, you have to create a caching mechanism that uses the request array as the cache key of the cache so that it can be easily created / referenced.
For a seedless cache, this approach is SLOW , but if you can warm up the cache and only expire when there are changes in the indexed documents (and then warm them up again), your users won't be able to tell.
This code idea works 20 points per page, I would like to work on it and see how I could make it smarter / more efficient, but the concept is there ...
// Build $request here and set $request['start'] to be the offset you want to reach
// Craft getCache() and setCache() functions or methods for cache handling.
// have $cloudSearchClient as your client
if(isset($request['start']) === true and $request['start'] >= 10000)
{
$originalRequest = $request;
$cursorSeekTarget = $request['start'];
$cursorSeekAmount = 10000; // first one should be 10K since there no pagination under this
$cursorSeekOffset = 0;
$request['return'] = '_no_fields';
$request['cursor'] = 'initial';
unset($request['start'],$request['facet']);
// While there is outstanding work to be done...
while( $cursorSeekAmount > 0 )
{
$request['size'] = $cursorSeekAmount;
// first hit the local cache
if(empty($result = getCache($request)) === true)
{
$result = $cloudSearchClient->Search($request);
// store the results in the cache
setCache($request,$result);
}
if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
{
// prepare the next request with the cursor
$request['cursor'] = $hits['cursor'];
}
$cursorSeekOffset = $cursorSeekOffset + $request['size'];
if($cursorSeekOffset >= $cursorSeekTarget)
{
$cursorSeekAmount = 0; // Finished, no more work
}
// the first request needs to get 10k, but after than only get 5K
elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
{
$cursorSeekAmount = 5000;
}
elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
{
$cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
// if we still need to seek more than 5K records, limit it back again to 5K
if($cursorSeekAmount > 5000)
{
$cursorSeekAmount = 5000;
}
// if we still need to seek more than 1K records, limit it back again to 1K
elseif($cursorSeekAmount > 1000)
{
$cursorSeekAmount = 1000;
}
}
}
// Restore aspects of the original request (the actual 20 items)
$request['size'] = 20;
$request['facet'] = $originalRequest['facet'];
unset($request['return']); // get the default returns
if(empty($result = getCache($request)) === true)
{
$result = $cloudSearchClient->Search($request);
setCache($request,$result);
}
}
else
{
// No cursor required
$result = $cloudSearchClient->Search( $request );
}
Note that this was done using a custom AWS client, not an official SDK class, but the query and search structures should be consistent.
source to share