How to remove leading Unciode characters from a file?

Question

How to remove leading Unciode characters from a file?

I am processing several thousand xml files and have several problematic files.

In each case, they contain lead Unicode characters such as C3 AF C2 BB C2 BF

and EF BB BF

, etc.

In all cases, the file contains only ASCII characters (after the header bytes), so there is no risk of data loss converting it to ASCII.

I am not allowed to change the contents of files on disk, only use them as input to my script.

In the simplest case, I would be happy to convert such files to ASCII (all input files are parsed, some changes are made and written to the output directory where a second script will process them.)

How should I code this? When I try:

with open(filePath, "rb") as file:
    contentOfFile = file.read()

unicodeData = contentOfFile.decode("utf-8")
asciiData = unicodeData.encode("ascii", "ignore")

with open(filePath, 'wt')  as file:
    file.write(asciiData)

I am getting an error must be str, not bytes

.

I have also tried

    asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')

with the same result. How to fix it?

Or is there any other way to hide the file?

+2

python python-3.x unicode

Mawg 31 jul. '15 at 9:23

source to share

1 answer

falsetru · Accepted Answer · 2015-07-31T09:27:14+0000

...
asciiData = unicodeData.encode("ascii", "ignore")

asciiData

is a byte object because it is encoded. When opening a file, you need to use binary mode instead of text mode:

with open(filePath, 'wb')  as file:  # <---
    file.write(asciiData)

How to remove leading Unciode characters from a file?

More articles: