Python Queries, CSV, Sha256 and BOM

Question

Python Queries, CSV, Sha256 and BOM

I am building a CSV set for athletes using queries and Python 2.7.

These files are generated by MSFT Report Server and go through iso-8859-1, the requests say.

Because I deal with thousands every night, I want to go through the files and compare to the previous hash for the athlete. If the hash matches, I won't bother saving the file to disk. These files are small - about 6K at most - so there is no interference / streaming issue.

sha256 fails, however, due to the annoying spec with these files. I've looked at 10 different "solutions" here and can't find one that pulls the BOM through decode.encode so that I can do my sha256.

The workaround that I might have to go back to is that I can write the file to disk and then sha256 it there. But that seems like a very bad form.

If I can remove the BOM at the beginning, I have a process that works with sha256 and saves me from working with unnecessary files.

I think the problem might be that I'm supposedly trying to do string operations on a file. But since the object is still a hex stream u / ..., I thought these operators would work ...

Here are the details:

>>> r = requests.get('http://66.73.188.164/ReportServer?%2fCPTC%2fWomens1stHalfDetail&Team=&player=17424&rs:Format=CSV')
>>> r.status_code
200
>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x18afb70>
>>> r.encoding
'ISO-8859-1'
>>> print r.headers['content-type']
text/plain
>>> r.text[0]
u'\xff'

The first attempt to convert cannot be decoded using the specified encoding type!

>>> z = r.text
>>> z.decode('iso-8859-1').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

And in fact the "type" of z is now different from what was expected, perhaps because of sys (mac; utf8)?

>>> type(z)
<type 'unicode'>
>>> z[0]
u'\xff'
>>> z[0:5]
u'\xff\xfem\x00a'

Various attempts at decoding and encoding have not let me down; here is one of many such attempts.

>>> z.decode('utf-8-sig').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

I'm pretty sure the answer is one line; I just don't see it. Any recommendations that are most appreciated.

+3

python csv byte-order-mark python-requests

Todd curry 05 oct. 14 at 13:37

source to share

2 answers

You should just use r.content

for example

r.content.decode('utf8')

Alternatively, you can also override the r.encoding

execution

r.encoding = 'utf8'

And then you can use r.text

without worry.

0

Ian Stapleton Cordasco 06 oct. 14 at 1:54

source to share

lisu · Accepted Answer · 2014-10-05T13:47:04+0000

Perhaps you can try to omit the BOM by encoding only the rest of the file to get sha256? How in:

z = r.text[2:]
z.decode ...

The same logic should apply to hashes of files already saved to disk, but that shouldn't be a problem.

Python Queries, CSV, Sha256 and BOM

More articles: