Elicsearch Pattern_capture filter emits a token that does not match pattern

I have a case where I need to extract a portion of the domain from emails found in the body. I have used uken_url_email tokenizer to generate emails as one. And I have a pattern_capture filter that outputs the "@ (. +)" Pattern string. But uax_url_email also returns words that are not an email address, and the image capture filter does not filter that. Any suggestions?

"custom_analyzer":{
 "tokenizer": "uax_url_email",
  "filter": [
       "email_domain_filter"
   ]
}
"filter": {
  "email_domain_filter":{
           "type": "pattern_capture",
           "preserve_original": false,
            "patterns": [
                      "@(.+)"
              ]
   }
}

      

input string: " my email id is xyz@gmail.com "

Output tokens: mine, email, id, is, gmail.com

But I only need gmail.com

+3


source to share


1 answer


"If none of the patterns match, or if preserveOriginal is true, the original token will be preserved."

https://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html



Try adding a template that matches other tokens but does not contain a capture group (eg ". *")

+1


source







All Articles