How to remove only accents but not umlauts from strings in Python

I am using the following code

import unicodedata
def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if unicodedata.category(c) != 'Mn')
strip_accents('ewaláièÜÖ')

      

which returns

'ewalaieUO'

      

But I want him to return

'ewalaieÜÖ'

      

Is there an easier way than replacing characters with str.replace (char_a, char_b)? How can I handle this effectively?

+3


source to share


1 answer


So let's start with your test input:

In [1]: test
Out[1]: 'ewaláièÜÖ'

      

See what happens to it during normalization:

In [2]: [x for x in unicodedata.normalize('NFD', test)]
Out[2]: ['e', 'w', 'a', 'l', 'a', '́', 'i', 'e', '̀', 'U', '̈', 'O', '̈']

      

And here are the unicodedata categories for each normalized item:

In [3]: [unicodedata.category(x) for x in unicodedata.normalize('NFD', test)]
Out[3]: ['Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Ll', 'Mn', 'Lu', 'Mn', 'Lu', 'Mn']

      



As you can see, the category Mn

contains not only "accents" but also "umlauts". So instead, unicodedata.category

you can useunicodedata.name

In [4]: [unicodedata.name(x) for x in unicodedata.normalize('NFD', test)]
Out[4]: ['LATIN SMALL LETTER E',
 'LATIN SMALL LETTER W',
 'LATIN SMALL LETTER A',
 'LATIN SMALL LETTER L',
 'LATIN SMALL LETTER A',
 'COMBINING ACUTE ACCENT',
 'LATIN SMALL LETTER I',
 'LATIN SMALL LETTER E',
 'COMBINING GRAVE ACCENT',
 'LATIN CAPITAL LETTER U',
 'COMBINING DIAERESIS',
 'LATIN CAPITAL LETTER O',
 'COMBINING DIAERESIS']

      

Here are the accents COMBINING ACUTE/GRAVE ACCENT

and the umlauts COMBINING DIAERESIS

. So here is my suggestion on how to fix your code:

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if not unicodedata.name(c).endswith('ACCENT')) 

strip_accents(test)
'ewalaieÜÖ'

      

Also, as you can read the unicodedata documentation , this module is just a wrapper around the available database here , so please take a look at the list of names from that database to make sure it covers everything you need.

+2


source







All Articles