Converting unicode number to ascii string
I was looking for an easy way to convert a number from a unicode string to an ascii string in python. For example, entering:
input = u'\u0663\u0669\u0668\u066b\u0664\u0667'
Should give out '398.47'
.
I started with:
NUMERALS_TRANSLATION_TABLE = {0x660:ord("0"), 0x661:ord("1"), 0x662:ord("2"), 0x663:ord("3"), 0x664:ord("4"), 0x665:ord("5"), 0x666:ord("6"), 0x667:ord("7"), 0x668:ord("8"), 0x669:ord("9"), 0x66b:ord(".")}
input.translate(NUMERALS_TRANSLATION_TABLE)
This solution worked, but I want to be able to support all character-related numbers in unicode, not just Arabic. I can translate digits by traversing the unicode string and running unicodedata.digit(input[i])
for each character. I don't like this solution because it doesn't solve '\u066b'
or '\u2013'
. I could solve them using translate
as a fallback, but I'm not sure if there are other such characters that I am currently not aware of, and so I am trying to find a better, more elegant solution.
Any suggestions are greatly appreciated.
source to share
Using unicodedata.digit()
to find digits for "numeric" code points is the correct method:
>>> import unicodedata
>>> unicodedata.digit(u'\u0663')
3
It uses standard Unicode information to find numeric values ββfor a given code point.
You can build a translation table using str.isdigit()
to check digits; this is true for all code points for which the standard defines a numerical value. For decimal points, you can search DECIMAL SEPARATOR
in the name; the standard does not track them separately for any other metric:
NUMERALS_TRANSLATION_TABLE = {
i: unicode(unicodedata.digit(unichr(i)))
for i in range(2 ** 16) if unichr(i).isdigit()}
NUMERALS_TRANSLATION_TABLE.update(
(i, u'.') for i in range(2 ** 16)
if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))
This creates a table with 447 records, including 2 decimal points in U + 066b ARABIC DECIMAL SEPARATOR and U + 2396 DECIMAL SEPARATOR KEY CHARACTER ; the latter is actually just a created character to put the decimal separator on the numeric keypad where the manufacturer doesn't want to take on the printing of the decimal separator ,
or .
on that key.
Demo:
>>> import unicodedata
>>> NUMERALS_TRANSLATION_TABLE = {
... i: unicode(unicodedata.digit(unichr(i)))
... for i in range(2 ** 16) if unichr(i).isdigit()}
>>> NUMERALS_TRANSLATION_TABLE.update(
... (i, u'.') for i in range(2 ** 16)
... if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))
>>> input = u'\u0663\u0669\u0668\u066b\u0664\u0667'
>>> input.translate(NUMERALS_TRANSLATION_TABLE)
'398.47'
source to share