Fixing corrupted coding (using Python)

Question

Fixing corrupted coding (using Python)

I have a bunch of text files containing Korean characters with incorrect encodings. In particular, it seems that the characters are EUC-KR encoded, but the files themselves were saved using the UTF8 + BOM.

So far, I have managed to fix the file with the following:

Open the file with EditPlus (it shows the encoding of the file UTF8+BOM

)
In EditPlus save the file as ANSI

Finally, in Python:

with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
    contents = source_file.read()

with open(html, 'w+b') as dest_file:
    dest_file.write(contents.encode('utf-8'))

I want to automate this, but I couldn't do it. I can open the source file in Python:

codecs.open(html, 'rb', encoding='utf-8-sig')

However, I haven't been able to figure out how to do part 2 ..

+1

python encoding

joon Dec 25. 13 at 2:12

source to share

1 answer

Martijn pieters · Accepted Answer · 2013-12-25T02:16:57+0000

I assume that the text is already EUC-KR encoded and then UTF-8 encoded again. If so, Latin 1 encoding (which is what Windows calls ANSI) is indeed the best way to go back to the original EUC-KR byte setting.

Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:

import io

with io.open(html, encoding='utf-8-sig') as infh:
    data = infh.read().encode('latin1').decode('euc-kr')

with io.open(html, 'w', encoding='utf8') as outfh:
    outfh.write(data)

I use io.open()

function instead codecs

as a more reliable method; io

is a new Python 3 library that has also been ported to Python 2.

Demo:

>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

Fixing corrupted coding (using Python)

More articles: