Elasticsearch crawl and scroll - add to new index
Elastic search and noobie command line programming.
I have elasticsearch installed locally on my machine and want to pull documents from a server using a different es version using the scan and scroll api and add them to my index. I am having trouble figuring out how to do this using the api array for es.
Right now in the testing phase, I just pull a few documents from the server using the following code (which works):
http MY-OLD-ES.com:9200/INDEX/TYPE/_search?size=1000 | jq .hits.hits[] -c | while read x; do id="`echo "$x" | jq -r ._id`"; index="`echo "$x" | jq -r ._index`"; type="`echo "$x" | jq -r ._type`"; doc="`echo "$x" | jq ._source`"; http put "localhost:9200/junk-$index/$type/$id" <<<"$doc"; done
Any clues on how scanning and scrolling works (noob and a little confused). So far, I know that I can scroll and get the scroll ID, but I don't know what to do with the scroll ID. If i call
http get http://MY-OLD-ES.com:9200/my_index/_search?scroll=1m&search_type=scan&size=10
I will get the scroll id. Could it be submitted and analyzed in the same way? Also, I believe I would need a while loop to tell it to keep asking. How am I supposed to do this?
Thank!
source to share
The check and scroll documentation explains this pretty clearly. After you receive the scroll_id
(long base64 encoded string), you pass it in with the request body. With curl, the request will look something like this:
curl -XGET 'http://MY-OLD-ES.com:9200/_search/scroll?scroll=1m' -d '
c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0
NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1Vy
UVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YW
xfaGl0czoxOw==
'
Note that although there was a first request to open the scroll, there was a /my_index/_search
second request to read data /_search/scroll
. Every time you call this by passing a querystring ?scroll=1m
, it updates the timeout to automatically close the scroll.
There are two more things to be aware of:
- The
size
scroll that is passed on opening is applied to each shard, so you getsize
multiplied by the number of shards in your index for each query. - Each request
/_search/scroll
will return a new onescroll_id
, which you must pass on the next call to get the next batch of results. You can't just call with the samescroll_id
.
It is completed if no hits are returned in the scroll request.
source to share