Why can't I normalize this random unicode string?

Question

Why can't I normalize this random unicode string?

I need to estimate the levenshtein edit distance on unicode strings, which means that two lines containing identical content must be normalized to avoid offsetting the edit distance.

This is how I generate random unicode strings for my tests:

def random_unicode(length=10):
    ru = lambda: unichr(random.randint(0, 0x10ffff))
    return ''.join([ru() for _ in xrange(length)])

And here's the simplest test case:

import unicodedata
uni = random_unicode()
unicodedata.normalize(uni, 'NFD')

And here's the error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

I checked to uni

be, indeed, a unicode object:

u'\U00020d93\U000fb2e6\U0005709a\U000bc31e\U00080262\U00034f00\U00059941\U0002dd09\U00074f6d\U0009ef7a'

Can someone enlighten me?

+3

python unicode normalization unicode-normalization python-unicode

blz 18 jan. At 15:05

source to share

1 answer

Joachim sauer · Accepted Answer · 2013-01-18T15:10:27+0000

You have changed the parameters normalize

. From the relevant documentation :

unicodedata.normalize(form, unistr)

Returns the normal form for a Unicode * unistr * string. Valid values for the form are "NFC", "NFKC", "NFD", and "NFKD".

The first argument is the form and the second is the normalized string. This works great:

unicodedata.normalize('NFD', uni)

Why can't I normalize this random unicode string?

More articles: