How to split words and numbers using tokenization
Is it possible to set up custom tokenization rules for a field that splits words containing letters and numbers into separate tokens? For example, I would like the string "50pc" to be split into two markers "50" and "pc".
I could create an override for each character of the number to treat it as a character, but that would give me three words "5", "0" and "pc", which is not what I want.
Is it possible to do this using tokenization, or do I need to preprocess the data?
source to share
Tokenizer notation is for breaking up type forms 10x4
into 10
and 4
. Splitting without a boundary symbol seems impossible at first. But ... you can override the value of the tokenizer. So here's an (untested) idea.
- Create a field that uses
admin:database-add-field-tokenizer-override
to classify numeric characters likeremove
orpunctuation
. Customize the root of the field, include and exclude if you like. - Create another field that does the same for the alphabet characters.
- Leave the normal text query field.
With this configuration, you can still use cts:word-query
to match 50pc
as well as use cts:field-word-query
to match 50
or pc
.
However, pre-processing may be the best way to process 50pc
. This way you can include units in the layout like <pieces xmlns="http://example.com/2014/units" value="50">50pc</pieces>
- or something along those lines. This can give you a lot of flexibility in the long run.
source to share
The short answer is: no, custom tokenization doesn't give you that much flexibility right now.
Consider if this is really a problem for the three tokens "5", "0", "pc". It depends on your application, your data, and the types of requests you are making. This will make a difference for wildcards, and for longer numbers, you might get more complex queries, or require a position to force them to pinpoint unfiltered ones, because for field queries, numbers turn into phrases. You will end up with longer lists than you would otherwise, and this can cause problems in some cases.
source to share