Why can't I normalize this random unicode string?

I need to estimate the levenshtein edit distance on unicode strings, which means that two lines containing identical content must be normalized to avoid offsetting the edit distance.

This is how I generate random unicode strings for my tests:

def random_unicode(length=10):
    ru = lambda: unichr(random.randint(0, 0x10ffff))
    return ''.join([ru() for _ in xrange(length)])

      

And here's the simplest test case:

import unicodedata
uni = random_unicode()
unicodedata.normalize(uni, 'NFD')

      

And here's the error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

      

I checked to uni

be, indeed, a unicode object:

u'\U00020d93\U000fb2e6\U0005709a\U000bc31e\U00080262\U00034f00\U00059941\U0002dd09\U00074f6d\U0009ef7a'

      

Can someone enlighten me?

+3


source to share


1 answer


You have changed the parameters normalize

. From the relevant documentation :

unicodedata.normalize(form, unistr)

Returns the normal form for a Unicode * unistr * string. Valid values ​​for the form are "NFC", "NFKC", "NFD", and "NFKD".



The first argument is the form and the second is the normalized string. This works great:

unicodedata.normalize('NFD', uni)

      

+5


source







All Articles