How to remove only accents but not umlauts from strings in Python
I am using the following code
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
strip_accents('ewaláièÜÖ')
which returns
'ewalaieUO'
But I want him to return
'ewalaieÜÖ'
Is there an easier way than replacing characters with str.replace (char_a, char_b)? How can I handle this effectively?
source to share
So let's start with your test input:
In [1]: test
Out[1]: 'ewaláièÜÖ'
See what happens to it during normalization:
In [2]: [x for x in unicodedata.normalize('NFD', test)]
Out[2]: ['e', 'w', 'a', 'l', 'a', '́', 'i', 'e', '̀', 'U', '̈', 'O', '̈']
And here are the unicodedata categories for each normalized item:
In [3]: [unicodedata.category(x) for x in unicodedata.normalize('NFD', test)]
Out[3]: ['Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Ll', 'Mn', 'Lu', 'Mn', 'Lu', 'Mn']
As you can see, the category Mn
contains not only "accents" but also "umlauts". So instead, unicodedata.category
you can useunicodedata.name
In [4]: [unicodedata.name(x) for x in unicodedata.normalize('NFD', test)]
Out[4]: ['LATIN SMALL LETTER E',
'LATIN SMALL LETTER W',
'LATIN SMALL LETTER A',
'LATIN SMALL LETTER L',
'LATIN SMALL LETTER A',
'COMBINING ACUTE ACCENT',
'LATIN SMALL LETTER I',
'LATIN SMALL LETTER E',
'COMBINING GRAVE ACCENT',
'LATIN CAPITAL LETTER U',
'COMBINING DIAERESIS',
'LATIN CAPITAL LETTER O',
'COMBINING DIAERESIS']
Here are the accents COMBINING ACUTE/GRAVE ACCENT
and the umlauts COMBINING DIAERESIS
. So here is my suggestion on how to fix your code:
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if not unicodedata.name(c).endswith('ACCENT'))
strip_accents(test)
'ewalaieÜÖ'
Also, as you can read the unicodedata documentation , this module is just a wrapper around the available database here , so please take a look at the list of names from that database to make sure it covers everything you need.
source to share