How to handle IncompleteRead: in python
I am trying to get some data from a website. However, it brings me back incomplete read
. The data I am trying to get is a huge collection of nested links. I did some research on the internet and found it could be due to server error reaching the expected size). I also found a workaround for above on this link
However, I am not sure how to use this for my case. Below is the code I am working on
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)
for tag in links:
name = tag['alt']
tag['url'] = urlparse.urljoin(urls, tag['url'])
r = br.open(tag['url'])
page_child = br.response().read()
soup_child = BeautifulSoup(page_child)
contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
print contracts
print data_usage
Please help me with this. thank
The link you included in your question is just a wrapper that executes the urllib read () function that catches any pending read exceptions for you.If you don't want to implement this whole patch, you can always just insert a try / catch loop. where you will read your links. For example:
try:
page = urllib2.urlopen(urls).read()
except httplib.IncompleteRead, e:
page = e.partial
for python3
try:
page = request.urlopen(urls).read()
except (http.client.IncompleteRead) as e:
page = e.partial
I'll find out in my case: send an HTTP / 1.0 request, adding this fix the problem.
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after executing the request:
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I go back to http 1.1 with (for connections supporting 1.1):
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
the trick is using http 1.0 instead of the standard http / 1.1 http 1.1 can handle chunks but for some reason the webserver doesn't, so we make the request to http 1.0
What worked for me was to catch the IncompleteRead as an exception and collect the data you managed to read in each iteration by putting it in a loop like below: (Note: I'm using Python 3.4.1 and the urllib library changed between 2.7 and 3.4)
try:
requestObj = urllib.request.urlopen(url, data)
responseJSON=""
while True:
try:
responseJSONpart = requestObj.read()
except http.client.IncompleteRead as icread:
responseJSON = responseJSON + icread.partial.decode('utf-8')
continue
else:
responseJSON = responseJSON + responseJSONpart.decode('utf-8')
break
return json.loads(responseJSON)
except Exception as RESTex:
print("Exception occurred making REST call: " + RESTex.__str__())
You can use requests
instead urllib2
. requests
is based on urllib3
, so it rarely runs into any problem. Put it in a loop to try it 3 times and it will be much stronger. You can use it like this:
import requests
msg = None
for i in [1,2,3]:
try:
r = requests.get(self.crawling, timeout=30)
msg = r.text
if msg: break
except Exception as e:
sys.stderr.write('Got error when requesting URL "' + self.crawling + '": ' + str(e) + '\n')
if i == 3 :
sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL "{1}" ==> {2}\n'. format(inspect.getframeinfo(inspect.currentframe()), self.crawling, e))
raise e
time.sleep(10*(i-1))
I found that my virus detector / firewall was the cause of this. Online Shield is part of AVG.
I've tried all of these solutions and none of them worked for me. Actually what worked instead of using urllib I just used http.client (Python 3)
conn = http.client.HTTPConnection('www.google.com') conn.request('GET', '/') r1 = conn.getresponse() page = r1.read().decode('utf-8')
This works fine every time, whereas with urllib it returned an exception with no spaces every time.
I am just adding more exceptions to convey this problem.
just like
try:
r = requests.get(url, timeout=timeout)
except (requests.exceptions.ChunkedEncodingError, requests.ConnectionError) as e:
logging.error("There is a error: %s" % e)
This mostly happens when the site you are reading the data is overloaded, to fix the problem and try again. It helped me.
try:
r = requests.get(url, timeout=timeout)
except (requests.exceptions.ChunkedEncodingError) as e:
r=request.get(url,timeout=timeout)