Data model for fields that change frequently in ElasticSearch

What is the best way to handle fields that change frequently within a document for ElasticSearch? For their papers on partial updates ...

Internally, however, the update API simply manages the same retrieve-change-reindex process we've already described.

Specifically, what should be done if indexing a document is likely to be expensive given the number of fields indexed and the size of some of the text fields that need to be parsed?

Use SO browse and Q&A vote count as a concrete example. It would be expensive to re-index the body of the text just to update these values.

+3


source to share


3 answers


Perhaps you shouldn't update that often. Perhaps things like votes / opinions should be updated periodically in ES, whereas more important fields like answers / questions should be pushed immediately. Think what's most important and see if you can get away with some degree of imprecision.

ElasticSearch is great for text search, but I wouldn't consider ES to support SO in general (or similar applications). This can be a useful tool for finding answers / questions on SO or for internal applications (e.g. parsing logs / events). But perhaps the actual data maintenance could have been better done with a different solution? Maybe for most of the work it should be powered by Cassandra? You get the idea ...



If you want to use ES as a solution to your needs and you MUST update frequently, you can definitely consider the parent / child model already mentioned. of course this method will require more memory / disk space and will take more CPU / time when querying totals. An alternative would be to have search fields for the parent store and let the child hold the metadata (where the child fields are not parsed). this will allow you to visit updates frequently without having to recycle the expensive re-index since there is nothing to index.

You can also think about what I mentioned above and see if you can get away with some resilience. This can be done in many ways as well. You can modify your requests by change type, or change the refresh / reset interval, or consider removing duplicate updates if you are submitting updates in bulk. They also have their drawbacks ...

+2


source


I think the best way to deal with this change is to split the document (you can use a child parent relationship or just have a parent id) and make the document as small as possible (move the volatile part to new types).

This might be a way to fulfill your requirement, say SO,

Several types can be used for this, consider this post (views and votes).



  • Create a type for post, view and vote.
  • To post a post, specify the document post type (post index, title description tag), and for each submission of that post, you can index the document to a kind of view (with post id), and if you voted, you can index the vote (no votes, identification information and other information that you need [as a positive or negative flag]) for the type of vote .
  • So, to get the views to post use the post id filter and get the number of documents as types
  • To avoid getting votes, use statistics aggregation for no votes or term aggregation followed by statistics aggregation for positive and negative votes.

This, I think, is better, and there may be a different opinion.

thank

+1


source


What I am doing is that I am using a database like mongo or mysql to store properties that are updated frequently and use elastic search to store documents to find text.

Example. I want to store data about the book and its contents, and I also want the total number of views, updating and re-indexing the document every time the user views it, is a complete overkill.

0


source







All Articles