Why does "" appear in my data?

I downloaded the file 'pi_million_digits.txt' from here:

https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt

Then I used this code to open and read it:

filename = 'pi_million_digits.txt'

with open(filename) as file_object:
    lines = file_object.readlines()

pi_string = ''
for line in lines:
    pi_string += line.strip()

print(pi_string[:52] + "...")
print(len(pi_string))

      

However, the result is correct, except that it is preceded by the same strange characters: "รฏ" ยฟ3.141 .... "

What is causing these strange characters? I am stripping lines, so I expect characters like this to be removed.

+3


source to share


1 answer


It looks like you are opening a file with UTF-8 Byte Order Mark encoding using ISO-8859-1 encoding (presumably because this is the default encoding in your OS).

If you open it as bytes and read the first line, you should see something like this:

>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

      

... where \xef\xbb\xbf

is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like you get:

>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'รฏยปยฟ3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

      

... and opening it, since UTF-8 shows the actual U + FEFF BOM character:



>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

      

To remove the mark, use special encoding utf-8-sig

:

>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

      

The use next()

in the above examples is for demonstration purposes only. In your code, you just need to add an argument encoding

to your string open()

, eg.

with open(filename, encoding='utf-8-sig') as file_object:
    # ... etc.

      

+4


source







All Articles