Handling accented Unicode characters with python regex module
I have two functions that work fine with ASCII strings and use a module re
:
import re
def findWord(w):
return re.compile(r'\b{0}.*?\b'.format(w), flags=re.IGNORECASE).findall
def replace_keyword(w, c, x):
return re.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=re.I)
However, they cannot use character utf-8
encoded strings . On further searching, I found that the module is regex
better suited for Unicode strings and hence I am trying to port it to use the regex
last couple of hours, but nothing seems to work. This is what I have at the moment:
import regex
def findWord(w):
return regex.compile(r'\b{0}.*?\b'.format(w), flags=regex.IGNORECASE|regex.UNICODE).findall
def replace_keyword(w, c, x):
return regex.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=regex.IGNORECASE|regex.UNICODE)
However, when using an accented (not normalized) string utf-8
, I keep getting the error ordinal not in range
.
EDIT: A suggested possible duplicate question: Regular expression to match non-english characters? doesn't solve my problem. I want to use the python re
/ module regex
. Secondly, I want the functions find
and replace
work with python.
EDIT: I am using python 2
EDIT: If you feel you can help me get these two functions working with Python 3, please let me know. Hopefully I can call python 3 to use only these two functions through my python 2 script.
source to share
I think I am heading somewhere. I'm trying to get this working without using modules, re
or regex
, but simple python:
found_keywords = []
for word in keyword_list:
if word.lower() in article_text.lower():
found_keywords.append(word)
for word in found_keywords: # highlight the found keyword in the text
article_text = article_text.lower().replace(word.lower(), '<mark style="background-color:%s">%s</mark>' % (yellow_color, word))
Now I just need to somehow replace the found keywords with a case insensitive and I should be good to go.
Just help me with this last step of replacing case insensitive keywords without using re
or regex
so that it works for accented strings.
source to share