Elicsearch Pattern_capture filter emits a token that does not match pattern
I have a case where I need to extract a portion of the domain from emails found in the body. I have used uken_url_email tokenizer to generate emails as one. And I have a pattern_capture filter that outputs the "@ (. +)" Pattern string. But uax_url_email also returns words that are not an email address, and the image capture filter does not filter that. Any suggestions?
"custom_analyzer":{
"tokenizer": "uax_url_email",
"filter": [
"email_domain_filter"
]
}
"filter": {
"email_domain_filter":{
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"@(.+)"
]
}
}
input string: " my email id is xyz@gmail.com "
Output tokens: mine, email, id, is, gmail.com
But I only need gmail.com
+3
source to share