Converting file to Ascii throws exceptions

Question

Converting file to Ascii throws exceptions

As a result of my previous question, I coded this:

def ConvertFileToAscii(args, filePath):
    try:
       # Firstly, make sure that the file is writable by all, otherwise we can't update it
        os.chmod(filePath, 0o666)

        with open(filePath, "rb") as file:
            contentOfFile = file.read()

        unicodeData = contentOfFile.decode("utf-8")
        asciiData = unicodeData.encode("ascii", "ignore")

        asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')

        temporaryFile = tempfile.NamedTemporaryFile(mode='wt', delete=False)
        temporaryFileName = temporaryFile.name

        with open(temporaryFileName, 'wb')  as file:
            file.write(asciiData)

        if ((args.info) or (args.diagnostics)):
            print(filePath + ' converted to ASCII and stored in ' + temporaryFileName)


        return temporaryFileName

    #
    except KeyboardInterrupt:
        raise

    except Exception as e:
        print('!!!!!!!!!!!!!!!\nException while trying to convert ' + filePath + ' to ASCII')
        print(e)
        exc_type, exc_value, exc_traceback = sys.exc_info()
        print(traceback.format_exception(exc_type, exc_value, exc_traceback))

        if args.break_on_error:
            sys.exit('Break on error\n')

When I run it, I get exceptions like this:

['Traceback (most recent call last):
', '  File "/home/ker4hi/tools/xmlExpand/xmlExpand.py", line 99, in ConvertFileToAscii
    unicodeData = contentOfFile.decode("utf-8")
    ', "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1081: invalid start byte"]

What am I doing wrong?

I really don't care that losing the data converts it to ASCII.

ox9C Ü

a U with a diacritic mark (Umlaut), I can live without it.

How can I convert files like this to only contain pure Ascii characters? Do I really need to open them as bi8nary and check every byte?

+3

python python-3.x unicode

Mawg 06 Aug 15 at 10:47

source to share

4 answers

I really don't care that losing the data converts it to ASCII .... How can I convert files like this to only contain pure Ascii characters?

One way is to use the replace parameter for the method decode

. The advantage of replacing over ignoring is that you get placeholders for missing values, which help prevent misinterpretation of the text.

Make sure to use ASCII encoding and not UTF-8. Otherwise, you may lose adjacent ascii characters when the decoder tries to resynchronize.

Finally, run encode('ascii')

after the decoding step. Otherwise, you are left with a unicode string and not a byte string.

>>> string_of_unknown_encoding = 'L\u00f6wis'.encode('latin-1')
>>> now_in_unicode = string_of_unknown_encoding.decode('ascii', 'replace')
>>> back_to_bytes = now_in_unicode.replace('\ufffd', '?').encode('ascii')
>>> type(back_to_bytes)
<class 'bytes'>
>>> print(back_to_bytes)
b'L?wis'

However, TheRightWay ™ does this to start taking care of data loss and using the correct encoding (obviously your input is not UTF-8, otherwise the decoding won't fail):

>>> string_of_known_latin1_encoding = 'L\u00f6wis'.encode('latin-1')
>>> now_in_unicode = string_of_known_latin1_encoding.decode('latin-1')
>>> back_to_bytes = now_in_unicode.encode('ascii', 'replace')
>>> type(back_to_bytes)
<class 'bytes'>
>>> print(back_to_bytes)

+2

Raymond Hettinger 08 Aug 15 at 22:04

source to share

0x00f6

ö

(ouml) encoded in ISO-8859-1

. I am assuming you are using the wrong Unicode decoder.

Try: unicodeData = contentOfFile.decode("ISO-8859-1")

+1

lucasg 06 Aug '15 at 11:00

source to share

You don't need to load the entire file into memory and call .decode()

on it. open()

has a parameter encoding

(use io.open()

in Python 2):

with open(filename, encoding='ascii', errors='ignore') as file:
    ascii_char = file.read(1)

If you need ascii transliteration of Unicode text; consider unidecode

.

+1

jfs 08 Aug 15 at 21:15

source to share

Ofir · Accepted Answer · 2015-08-06T10:59:48+0000

Using:

contentOfFile.decode('utf-8', 'ignore')

The exception is the decoding phase, in which you have not ignored the error.

Converting file to Ascii throws exceptions

More articles: