Converting unicode number to ascii string

I was looking for an easy way to convert a number from a unicode string to an ascii string in python. For example, entering:

input = u'\u0663\u0669\u0668\u066b\u0664\u0667'

      

Should give out '398.47'

.

I started with:

NUMERALS_TRANSLATION_TABLE = {0x660:ord("0"), 0x661:ord("1"), 0x662:ord("2"), 0x663:ord("3"), 0x664:ord("4"), 0x665:ord("5"), 0x666:ord("6"), 0x667:ord("7"), 0x668:ord("8"), 0x669:ord("9"), 0x66b:ord(".")}
input.translate(NUMERALS_TRANSLATION_TABLE)

      

This solution worked, but I want to be able to support all character-related numbers in unicode, not just Arabic. I can translate digits by traversing the unicode string and running unicodedata.digit(input[i])

for each character. I don't like this solution because it doesn't solve '\u066b'

or '\u2013'

. I could solve them using translate

as a fallback, but I'm not sure if there are other such characters that I am currently not aware of, and so I am trying to find a better, more elegant solution.

Any suggestions are greatly appreciated.

+3


source to share


2 answers


Using unicodedata.digit()

to find digits for "numeric" code points is the correct method:

>>> import unicodedata
>>> unicodedata.digit(u'\u0663')
3

      

It uses standard Unicode information to find numeric values ​​for a given code point.

You can build a translation table using str.isdigit()

to check digits; this is true for all code points for which the standard defines a numerical value. For decimal points, you can search DECIMAL SEPARATOR

in the name; the standard does not track them separately for any other metric:



NUMERALS_TRANSLATION_TABLE = {
    i: unicode(unicodedata.digit(unichr(i)))
    for i in range(2 ** 16) if unichr(i).isdigit()}
NUMERALS_TRANSLATION_TABLE.update(
    (i, u'.') for i in range(2 ** 16)
    if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))

      

This creates a table with 447 records, including 2 decimal points in U + 066b ARABIC DECIMAL SEPARATOR and U + 2396 DECIMAL SEPARATOR KEY CHARACTER ; the latter is actually just a created character to put the decimal separator on the numeric keypad where the manufacturer doesn't want to take on the printing of the decimal separator ,

or .

on that key.

Demo:

>>> import unicodedata
>>> NUMERALS_TRANSLATION_TABLE = {
...     i: unicode(unicodedata.digit(unichr(i)))
...     for i in range(2 ** 16) if unichr(i).isdigit()}
>>> NUMERALS_TRANSLATION_TABLE.update(
...     (i, u'.') for i in range(2 ** 16)
...     if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))
>>> input = u'\u0663\u0669\u0668\u066b\u0664\u0667'
>>> input.translate(NUMERALS_TRANSLATION_TABLE)
'398.47'

      

+3


source


>>> from unidecode import unidecode
>>> unidecode(u'\u0663\u0669\u0668\u066b\u0664\u0667')
'398.47'

      



0


source







All Articles