Why can't I normalize this random unicode string?
I need to estimate the levenshtein edit distance on unicode strings, which means that two lines containing identical content must be normalized to avoid offsetting the edit distance.
This is how I generate random unicode strings for my tests:
def random_unicode(length=10):
ru = lambda: unichr(random.randint(0, 0x10ffff))
return ''.join([ru() for _ in xrange(length)])
And here's the simplest test case:
import unicodedata
uni = random_unicode()
unicodedata.normalize(uni, 'NFD')
And here's the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
I checked to uni
be, indeed, a unicode object:
u'\U00020d93\U000fb2e6\U0005709a\U000bc31e\U00080262\U00034f00\U00059941\U0002dd09\U00074f6d\U0009ef7a'
Can someone enlighten me?
source to share
You have changed the parameters normalize
. From the relevant documentation :
unicodedata.normalize(form, unistr)
Returns the normal form for a Unicode * unistr * string. Valid values ββfor the form are "NFC", "NFKC", "NFD", and "NFKD".
The first argument is the form and the second is the normalized string. This works great:
unicodedata.normalize('NFD', uni)
source to share