How to handle IncompleteRead: in python

Question

How to handle IncompleteRead: in python

I am trying to get some data from a website. However, it brings me back incomplete read

. The data I am trying to get is a huge collection of nested links. I did some research on the internet and found it could be due to server error reaching the expected size). I also found a workaround for above on this link

However, I am not sure how to use this for my case. Below is the code I am working on

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)

for tag in links:
    name = tag['alt']
    tag['url'] = urlparse.urljoin(urls, tag['url'])
    r = br.open(tag['url'])
    page_child = br.response().read()
    soup_child = BeautifulSoup(page_child)
    contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
    data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
    print contracts
    print data_usage

Please help me with this. thank

+16

python python-2.7 web-scraping beautifulsoup mechanize

user1967046 21 jan. At 15:45

source to share

8 answers

I'll find out in my case: send an HTTP / 1.0 request, adding this fix the problem.

import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

after executing the request:

req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()

after I go back to http 1.1 with (for connections supporting 1.1):

httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'

the trick is using http 1.0 instead of the standard http / 1.1 http 1.1 can handle chunks but for some reason the webserver doesn't, so we make the request to http 1.0

+6

Sérgio Dec 17. 13 at 22:13

source to share

What worked for me was to catch the IncompleteRead as an exception and collect the data you managed to read in each iteration by putting it in a loop like below: (Note: I'm using Python 3.4.1 and the urllib library changed between 2.7 and 3.4)

try:
    requestObj = urllib.request.urlopen(url, data)
    responseJSON=""
    while True:
        try:
            responseJSONpart = requestObj.read()
        except http.client.IncompleteRead as icread:
            responseJSON = responseJSON + icread.partial.decode('utf-8')
            continue
        else:
            responseJSON = responseJSON + responseJSONpart.decode('utf-8')
            break

    return json.loads(responseJSON)

except Exception as RESTex:
    print("Exception occurred making REST call: " + RESTex.__str__())

+1

gDexter42 09 Aug '14 at 1:29

source to share

You can use requests

instead urllib2

. requests

is based on urllib3

, so it rarely runs into any problem. Put it in a loop to try it 3 times and it will be much stronger. You can use it like this:

import requests      

msg = None   
for i in [1,2,3]:        
    try:  
        r = requests.get(self.crawling, timeout=30)
        msg = r.text
        if msg: break
    except Exception as e:
        sys.stderr.write('Got error when requesting URL "' + self.crawling + '": ' + str(e) + '\n')
        if i == 3 :
            sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL "{1}" ==> {2}\n'.                       format(inspect.getframeinfo(inspect.currentframe()), self.crawling, e))
            raise e
        time.sleep(10*(i-1))

+1

Aminah Nuraini June 21. 15 at 16:44

source to share

I found that my virus detector / firewall was the cause of this. Online Shield is part of AVG.

0

nigel76 May 18 '15 at 19:08

source to share

I've tried all of these solutions and none of them worked for me. Actually what worked instead of using urllib I just used http.client (Python 3)

conn = http.client.HTTPConnection('www.google.com')
conn.request('GET', '/')
r1 = conn.getresponse()
page = r1.read().decode('utf-8')

This works fine every time, whereas with urllib it returned an exception with no spaces every time.

0

Brian 28 oct. 15 at 17:46

source to share

I am just adding more exceptions to convey this problem.
just like

try:
    r = requests.get(url, timeout=timeout)

except (requests.exceptions.ChunkedEncodingError, requests.ConnectionError) as e:
    logging.error("There is a error: %s" % e)

0

KJoker 16 Feb 17 at 1:52

source to share

This mostly happens when the site you are reading the data is overloaded, to fix the problem and try again. It helped me.

    try:
        r = requests.get(url, timeout=timeout)

    except (requests.exceptions.ChunkedEncodingError) as e:
        r=request.get(url,timeout=timeout)

0

keithwachira 17 Mar 17 at 7:09

source to share

Kyle · Accepted Answer · 2013-01-21T15:53:31+0000

The link you included in your question is just a wrapper that executes the urllib read () function that catches any pending read exceptions for you.If you don't want to implement this whole patch, you can always just insert a try / catch loop. where you will read your links. For example:

try:
    page = urllib2.urlopen(urls).read()
except httplib.IncompleteRead, e:
    page = e.partial

for python3

try:
    page = request.urlopen(urls).read()
except (http.client.IncompleteRead) as e:
    page = e.partial

How to handle IncompleteRead: in python

More articles: