Regex operator for Python

I would like to create a regex statement in Python 2.7.8 that will replace characters. It will work like this ...

ó -> o
ú -> u
é -> e
á -> a
í -> i
ù,ú  -> u

      

These are the only Unicode characters that I would like to change. Unicode characters like ë, ä

, I don't want to change. Thus, the word thójlà

will become tholja

. I'm sure there is a way so that I don't have to create all the regexes separately as shown below.

word = re.sub(ur'ó', ur'o', word)
word = re.sub(ur'ú', ur'u', word)
word = re.sub(ur'é', ur'e', word)
....

      

I tried to figure it out but no luck. Any help is appreciated!

+3


source to share


3 answers


Try with str.translate and maketrans

...

print('thójlà'.translate(str.maketrans('óúéáíùú', 'oueaiuu')))
# thojlà

      

This way you guarantee the only replacements you want to make.

If you had many lines to change, you should assign your maketrans to a variable, for example



table = str.maketrans('óúéáíùú', 'oueaiuu')

      

and then each line can be translated as

s.translate(table)

      

+4


source


With the String function, replace()

you can do something like:

x = "thójlà"                  
>>> x
'thójlà'
>>> x = x.replace('ó','o')
'thojlà'
>>> x = x.replace('à','a')
'thojla'

      

Generalized way:



# -*- coding: utf-8 -*-

replace_dict = {
    'á':'a',
    'à':'a',
    'é':'e',
    'í':'i',
    'ó':'o',
    'ù':'u',
    'ú':'u'
}

str1 = "thójlà"

for key in replace_dict:
    str1 = str1.replace(key, replace_dict[key])

print(str1) #prints 'thojla'

      

Third way if your list of character collations gets too big:

# -*- coding: utf-8 -*-

replace_dict = {
    'a':['á','à'],
    'e':['é'],
    'i':['í'],
    'o':['ó'],
    'u':['ù','ú']
}

str1 = "thójlà"

for key, values in replace_dict.items():
    for character in values:
        str1 = str1.replace(character, key)

print(str1)

      

+3


source


If you can use external packages, the easiest way I think would be using unidecode . For example:

from unidecode import unidecode

print(unidecode('thójlà'))
# prints: thojla

      

+1


source







All Articles