How to handle IncompleteRead: in python
I am trying to get some data from a website. However, it brings me back incomplete read
. The data I am trying to get is a huge collection of nested links. I did some research on the internet and found it could be due to server error reaching the expected size). I also found a workaround for above on this link
However, I am not sure how to use this for my case. Below is the code I am working on
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)
for tag in links:
name = tag['alt']
tag['url'] = urlparse.urljoin(urls, tag['url'])
r = br.open(tag['url'])
page_child = br.response().read()
soup_child = BeautifulSoup(page_child)
contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
print contracts
print data_usage
Please help me with this. thank
I'll find out in my case: send an HTTP / 1.0 request, adding this fix the problem.
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after executing the request:
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I go back to http 1.1 with (for connections supporting 1.1):
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
the trick is using http 1.0 instead of the standard http / 1.1 http 1.1 can handle chunks but for some reason the webserver doesn't, so we make the request to http 1.0
source to share
What worked for me was to catch the IncompleteRead as an exception and collect the data you managed to read in each iteration by putting it in a loop like below: (Note: I'm using Python 3.4.1 and the urllib library changed between 2.7 and 3.4)
try:
requestObj = urllib.request.urlopen(url, data)
responseJSON=""
while True:
try:
responseJSONpart = requestObj.read()
except http.client.IncompleteRead as icread:
responseJSON = responseJSON + icread.partial.decode('utf-8')
continue
else:
responseJSON = responseJSON + responseJSONpart.decode('utf-8')
break
return json.loads(responseJSON)
except Exception as RESTex:
print("Exception occurred making REST call: " + RESTex.__str__())
source to share
You can use requests
instead urllib2
. requests
is based on urllib3
, so it rarely runs into any problem. Put it in a loop to try it 3 times and it will be much stronger. You can use it like this:
import requests
msg = None
for i in [1,2,3]:
try:
r = requests.get(self.crawling, timeout=30)
msg = r.text
if msg: break
except Exception as e:
sys.stderr.write('Got error when requesting URL "' + self.crawling + '": ' + str(e) + '\n')
if i == 3 :
sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL "{1}" ==> {2}\n'. format(inspect.getframeinfo(inspect.currentframe()), self.crawling, e))
raise e
time.sleep(10*(i-1))
source to share
I've tried all of these solutions and none of them worked for me. Actually what worked instead of using urllib I just used http.client (Python 3)
conn = http.client.HTTPConnection('www.google.com') conn.request('GET', '/') r1 = conn.getresponse() page = r1.read().decode('utf-8')
This works fine every time, whereas with urllib it returned an exception with no spaces every time.
source to share