Drain token filter in Elasticsearch
I am trying to index some tags after they are closed and other filters applied. These tags can be multiple words.
What I am failing to do is apply the final token filter, which takes one token out of the token stream.
So, I would like the tags to be multiple words that needed to be stopped, the stop words removed, but then concatenated again in the same token before being stored in the index (for example, what does the tokenizer keyword do, but as a filter).
I don't see a way to do this if token filter methods are applied in Elasticsearch: if I market on white spaces and then stalk, all subsequent token filters will get those single tokens, not the entire token stream, right?
For example, I need a tag
fox jumping over the fence
to be stored in the index as a shared token as
fox jumping over the fence
but not
fox, jump, over the fence
Is there a way to do this without first processing the string in my application and then indexing it as not_analyzed fields?
source to share
After doing a little research, I found this thread:
http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html
which had the exact solution I was looking for.
I created a simple Elasticsearch plugin that only provides the Concatenate Token Filter, which you can find at:
https://github.com/francesconero/elasticsearch-concatenate-token-filter
source to share