How to remove only accents but not umlauts from strings in Python

Question

How to remove only accents but not umlauts from strings in Python

I am using the following code

import unicodedata
def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if unicodedata.category(c) != 'Mn')
strip_accents('ewaláièÜÖ')

which returns

'ewalaieUO'

But I want him to return

'ewalaieÜÖ'

Is there an easier way than replacing characters with str.replace (char_a, char_b)? How can I handle this effectively?

+3

python string python-unicode

zinyosrim June 15. 17 at 20:28

source to share

1 answer

running.t · Accepted Answer · 2017-06-15T21:48:17+0000

So let's start with your test input:

In [1]: test
Out[1]: 'ewaláièÜÖ'

See what happens to it during normalization:

In [2]: [x for x in unicodedata.normalize('NFD', test)]
Out[2]: ['e', 'w', 'a', 'l', 'a', '́', 'i', 'e', '̀', 'U', '̈', 'O', '̈']

And here are the unicodedata categories for each normalized item:

In [3]: [unicodedata.category(x) for x in unicodedata.normalize('NFD', test)]
Out[3]: ['Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Ll', 'Mn', 'Lu', 'Mn', 'Lu', 'Mn']

As you can see, the category Mn

contains not only "accents" but also "umlauts". So instead, unicodedata.category

you can useunicodedata.name

In [4]: [unicodedata.name(x) for x in unicodedata.normalize('NFD', test)]
Out[4]: ['LATIN SMALL LETTER E',
 'LATIN SMALL LETTER W',
 'LATIN SMALL LETTER A',
 'LATIN SMALL LETTER L',
 'LATIN SMALL LETTER A',
 'COMBINING ACUTE ACCENT',
 'LATIN SMALL LETTER I',
 'LATIN SMALL LETTER E',
 'COMBINING GRAVE ACCENT',
 'LATIN CAPITAL LETTER U',
 'COMBINING DIAERESIS',
 'LATIN CAPITAL LETTER O',
 'COMBINING DIAERESIS']

Here are the accents COMBINING ACUTE/GRAVE ACCENT

and the umlauts COMBINING DIAERESIS

. So here is my suggestion on how to fix your code:

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if not unicodedata.name(c).endswith('ACCENT')) 

strip_accents(test)
'ewalaieÜÖ'

Also, as you can read the unicodedata documentation , this module is just a wrapper around the available database here , so please take a look at the list of names from that database to make sure it covers everything you need.

How to remove only accents but not umlauts from strings in Python

More articles: