How to split words and numbers using tokenization

Question

How to split words and numbers using tokenization

Is it possible to set up custom tokenization rules for a field that splits words containing letters and numbers into separate tokens? For example, I would like the string "50pc" to be split into two markers "50" and "pc".

I could create an override for each character of the number to treat it as a character, but that would give me three words "5", "0" and "pc", which is not what I want.

Is it possible to do this using tokenization, or do I need to preprocess the data?

+3

marklogic

Eric Russell 13 nov. 14 at 17:26

source to share

3 answers

The short answer is: no, custom tokenization doesn't give you that much flexibility right now.

Consider if this is really a problem for the three tokens "5", "0", "pc". It depends on your application, your data, and the types of requests you are making. This will make a difference for wildcards, and for longer numbers, you might get more complex queries, or require a position to force them to pinpoint unfiltered ones, because for field queries, numbers turn into phrases. You will end up with longer lists than you would otherwise, and this can cause problems in some cases.

+2

mholstege 13 nov. 14 at 19:51

source to share

not sure, but this is the answer:

tokenizer.wordChars('_', '_');

please clear more if not.

0

user3560264 13 nov. 14 at 17:33

source to share

mblakele · Accepted Answer · 2014-11-13T19:06:08+0000

Tokenizer notation is for breaking up type forms 10x4

into 10

and 4

. Splitting without a boundary symbol seems impossible at first. But ... you can override the value of the tokenizer. So here's an (untested) idea.

Create a field that uses admin:database-add-field-tokenizer-override

to classify numeric characters like remove

or punctuation

. Customize the root of the field, include and exclude if you like.
Create another field that does the same for the alphabet characters.
Leave the normal text query field.

With this configuration, you can still use cts:word-query

to match 50pc

as well as use cts:field-word-query

to match 50

or pc

.

However, pre-processing may be the best way to process 50pc

. This way you can include units in the layout like <pieces xmlns="http://example.com/2014/units" value="50">50pc</pieces>

- or something along those lines. This can give you a lot of flexibility in the long run.

How to split words and numbers using tokenization

More articles: