How to remove leading Unciode characters from a file?
I am processing several thousand xml files and have several problematic files.
In each case, they contain lead Unicode characters such as C3 AF C2 BB C2 BF
and EF BB BF
, etc.
In all cases, the file contains only ASCII characters (after the header bytes), so there is no risk of data loss converting it to ASCII.
I am not allowed to change the contents of files on disk, only use them as input to my script.
In the simplest case, I would be happy to convert such files to ASCII (all input files are parsed, some changes are made and written to the output directory where a second script will process them.)
How should I code this? When I try:
with open(filePath, "rb") as file:
contentOfFile = file.read()
unicodeData = contentOfFile.decode("utf-8")
asciiData = unicodeData.encode("ascii", "ignore")
with open(filePath, 'wt') as file:
file.write(asciiData)
I am getting an error must be str, not bytes
.
I have also tried
asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')
with the same result. How to fix it?
Or is there any other way to hide the file?
source to share