How to split words and numbers using tokenization

Is it possible to set up custom tokenization rules for a field that splits words containing letters and numbers into separate tokens? For example, I would like the string "50pc" to be split into two markers "50" and "pc".

I could create an override for each character of the number to treat it as a character, but that would give me three words "5", "0" and "pc", which is not what I want.

Is it possible to do this using tokenization, or do I need to preprocess the data?

+3


source to share


3 answers


Tokenizer notation is for breaking up type forms 10x4

into 10

and 4

. Splitting without a boundary symbol seems impossible at first. But ... you can override the value of the tokenizer. So here's an (untested) idea.



With this configuration, you can still use cts:word-query

to match 50pc

as well as use cts:field-word-query

to match 50

or pc

.

However, pre-processing may be the best way to process 50pc

. This way you can include units in the layout like <pieces xmlns="http://example.com/2014/units" value="50">50pc</pieces>

- or something along those lines. This can give you a lot of flexibility in the long run.

+2


source


The short answer is: no, custom tokenization doesn't give you that much flexibility right now.



Consider if this is really a problem for the three tokens "5", "0", "pc". It depends on your application, your data, and the types of requests you are making. This will make a difference for wildcards, and for longer numbers, you might get more complex queries, or require a position to force them to pinpoint unfiltered ones, because for field queries, numbers turn into phrases. You will end up with longer lists than you would otherwise, and this can cause problems in some cases.

+2


source


not sure, but this is the answer:

tokenizer.wordChars('_', '_');

      

please clear more if not.

0


source







All Articles