Two seemingly identical unicode strings turn out to be different when using the repr () function, but how can I fix this?
I have two lists of unicode strings, one containing words taken from a text file and the other a list of sound file names from a directory removed from their extension. Some words in one list are the same as others. I tried to find matches using re.search(ur'(?iu)\b%s\b' % string1, string2)
fnmatch and even simple type comparisons string1 == string2
, all of which worked the first time I entered the first list for testing, but were unable to use the actual wordlist obtained from the text file.
When doing the test, to find out why, I monitored the Vietnamese word chào
present in both lists. Using isinstance(string, unicode)
confirmed that both were unicode. However, using repr()
both strings returned u'ch\xe0o'
in one case and u'cha\u0300o'
in the other. So it's pretty clear why they don't match.
So, I seem to have found the cause, but I'm not sure how to fix this. I tried to use .decode('utf-8')
as I thought it \xe0
might be utf-8. But all it did was return a Unicode encoding error. Also, if both strings are unicode and represent the same word, shouldn't they be the same? Execution print('%s Vs. %s' % (string1, string2))
returns chào Vs. chào
I'm lost here.
Many thanks for your help.
source to share
Some Unicode characters can be specified in various ways, as you discovered, either as a single code number or as a regular code number plus a combined code. The character \u0300
is COMBINING GRAVE ACCENT , which adds an emphasis to the previous character.
The process of binding a string to a common view is called normalization. You can use a moduleunicodedata
for this:
def n(str):
return unicodedata.normalize('NFKC', str)
>>> n(u'ch\xe0o') == n(u'cha\u0300o')
True
source to share
The problem seems to be the ambiguous representation of the serious accents in Unicode. Here is LATIN SMALL LETTER A WITH GRAVE and here is a COMBINED GRAY ACCENT , which in combination with "a" becomes more or less the same as the first. So the two representations are of the same character. There is actually a term for unicode for this: unicode equivalence .
To implement this in python, use unicodedata.normalize on the string before comparison. I tried the "NFC" mode which returns u'ch \ xe0o for both lines.
source to share