Python decoding with errors = Replace

Using Python 2.7, I grab some HTML from the site as strings and immediately decode it to unicode. Since I need to find out later where the decode errors occurred, I thought it would be better to use errors = "replace" to exclude exceptions from non-ASCII characters:

linkname = curlinkname.decode("utf-8", errors="replace")

      

In most cases, this replaces the placeholder problem symbol. However, when I run the code, I still get an exception from this line at one specific character (Ε«):

UnicodeEncodeError: 'charmap' codec can't encode character u'\u016b' in position 1: character maps to <undefined>

      

What's happening?

+3


source to share


1 answer


you need to install lib first

pip install chardet

      



then use it

import chardet
code = chardet.detect(curlinkname)
linkname = curlinkname.decode(code['encoding'], errors="replace")

      

0


source







All Articles