Handling accented Unicode characters with python regex module

I have two functions that work fine with ASCII strings and use a module re

:

import re

def findWord(w):
    return re.compile(r'\b{0}.*?\b'.format(w), flags=re.IGNORECASE).findall


def replace_keyword(w, c, x):
    return re.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=re.I)

      

However, they cannot use character utf-8

encoded strings . On further searching, I found that the module is regex

better suited for Unicode strings and hence I am trying to port it to use the regex

last couple of hours, but nothing seems to work. This is what I have at the moment:

import regex

def findWord(w):
    return regex.compile(r'\b{0}.*?\b'.format(w), flags=regex.IGNORECASE|regex.UNICODE).findall

def replace_keyword(w, c, x):
    return regex.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=regex.IGNORECASE|regex.UNICODE)

      

However, when using an accented (not normalized) string utf-8

, I keep getting the error ordinal not in range

.

EDIT: A suggested possible duplicate question: Regular expression to match non-english characters? doesn't solve my problem. I want to use the python re

/ module regex

. Secondly, I want the functions find

and replace

work with python.

EDIT: I am using python 2

EDIT: If you feel you can help me get these two functions working with Python 3, please let me know. Hopefully I can call python 3 to use only these two functions through my python 2 script.

+3


source to share


1 answer


I think I am heading somewhere. I'm trying to get this working without using modules, re

or regex

, but simple python:

found_keywords = []
for word in keyword_list:
    if word.lower() in article_text.lower():
         found_keywords.append(word)

for word in found_keywords:  # highlight the found keyword in the text
    article_text = article_text.lower().replace(word.lower(), '<mark style="background-color:%s">%s</mark>' % (yellow_color, word))

      



Now I just need to somehow replace the found keywords with a case insensitive and I should be good to go.

Just help me with this last step of replacing case insensitive keywords without using re

or regex

so that it works for accented strings.

0


source







All Articles