Why does "" appear in my data?
I downloaded the file 'pi_million_digits.txt' from here:
https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt
Then I used this code to open and read it:
filename = 'pi_million_digits.txt'
with open(filename) as file_object:
lines = file_object.readlines()
pi_string = ''
for line in lines:
pi_string += line.strip()
print(pi_string[:52] + "...")
print(len(pi_string))
However, the result is correct, except that it is preceded by the same strange characters: "รฏ" ยฟ3.141 .... "
What is causing these strange characters? I am stripping lines, so I expect characters like this to be removed.
source to share
It looks like you are opening a file with UTF-8 Byte Order Mark encoding using ISO-8859-1 encoding (presumably because this is the default encoding in your OS).
If you open it as bytes and read the first line, you should see something like this:
>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
... where \xef\xbb\xbf
is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like you get:
>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'รฏยปยฟ3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
... and opening it, since UTF-8 shows the actual U + FEFF BOM character:
>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
To remove the mark, use special encoding utf-8-sig
:
>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
The use next()
in the above examples is for demonstration purposes only. In your code, you just need to add an argument encoding
to your string open()
, eg.
with open(filename, encoding='utf-8-sig') as file_object:
# ... etc.
source to share