How do I perform a "lower case filter" along with "char_filter"?

As far as I read in the ES documentation:

  • "Character filters are used to strip a string before it is symbolized."
  • "After tokenization, the resulting token stream is passed through any defined token filters"

(source: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html )

From these two statements, I understand that the following steps are being taken:

  • char_filter;
  • lexical analysis;
  • filter.

Problem:

I may have a char_filter that calls multiple letters at once.

Example: ph → f.

However, "PH" will not be converted to "f" because "PH" is not part of the display.

So the analysis for philipp is extracting filipp, while Philipp is extracting philipp.

Working with upper and lower case (to achieve the same result in both cases) the number of mappings in char_filter will be (number of characters) ².

Example: ph → f; Ph → F; pH → f; PH → F.

I wouldn't be a problem if I only had 4 collations, but if I need more x² collations, the char_filter tends to become a big mess.

Index example:

{
    "settings" : {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "default_index" : {
                        "type" : "custom",
                        "tokenizer" : "whitespace",
                        "filter" : [
                            "lowercase"
                        ],
                        "char_filter" : [
                            "misc_simplifications"
                        ]
                    }
                },
                "char_filter" : {
                    "misc_simplifications" : {
                        "type" : "mapping",
                        "mappings" : [
                            "ph=>f","Ph=>F","pH=>f","PH=>F"
                        ]
                    }
                }
            }
        }
    }
}

      

Philosophical question:

I understand that I can refer to "ph" and "Ph" in the same way, but "pH" can mean something completely different. But is there a way to convert characters to lowercase before the char_filter phase? Does it make sense?

Because this big mapping gives me the feeling that I am doing something wrong or that I might find an easier (more elegant) solution.

+3


source to share


1 answer


you are correct in the sequence of steps:

However, the main purpose of the CharFilter is to cleanse the data to facilitate tokenization. For example by removing the XML tags or replacing the separator with a space character.

So - I would put it misc_simplifications

as a TokenFilter, which will be applied after the Lowercase filter.



{
"settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "default_index" : {
                    "type" : "custom",
                    "tokenizer" : "whitespace",
                    "filter" : [
                        "lowercase",
                        "misc_simplifications"
                    ]
                }
            },
            "filter" : {
                "misc_simplifications" : {
                    "type" : "pattern_replace",
                    "pattern": "ph",
                    "replacement":"f"
                }
            }
        }
    }
  }
}

      

Note. I used pattern replacement instead of mappings . You can also modify the regex to replace it where "ph" is at the beginning of the token.

Also - your collations look like phonetic substitutions. I'm not sure about your requirements, but it looks like it's possible that a phonetic token filter can help you.

+2


source







All Articles