How can I get around the pageview restrictions when clearing web data using Python?
I am using Python to cleanse the US population zip data from http: /www.city-data.com via this directory: http://www.city-data.com/zipDir.html . The specific pages I'm trying to clean up are individual zip code pages with URLs like this: http://www.city-data.com/zips/01001.html . All of the individual zip pages I need to access have the same URL format, so my script just does for zip_code in the following range:
- Creates the url specified zip code
- Trying to get a response from the URL
- If (2), check the HTTP of this url
- If HTTP is 200, fetches HTML and dumps the data to the list
- If HTTP is not 200, submit and count the error (Invalid postcode / URL)
- If there is no response from url due to error please skip this post code and error counter
- At the end of the script, print counter options and timestamp
The problem is when I run the script and it works fine for ~ 500 postcodes, then suddenly quits and returns repeated timeout errors. My suspicion is that the site server is restricting pageviews coming from my IP address, preventing me from filling in the number of clips I need to make (all 100,000 potential zip codes).
My question is this: is there a way to obfuscate the site server, for example by using some kind of proxy, so that it doesn't restrict my page views and I can clear all the data I need?
Thanks for the help! Here is the code:
##POSTAL CODE POPULATION SCRAPER##
import requests
import re
import datetime
def zip_population_scrape():
"""
This script will scrape population data for postal codes in range
from city-data.com.
"""
postal_code_data = [['zip','population']] #list for storing scraped data
#Counters for keeping track:
total_scraped = 0
total_invalid = 0
errors = 0
for postal_code in range(1001,5000):
#This if statement is necessary because the postal code can't start
#with 0 in order for the for statement to interate successfully
if postal_code <10000:
postal_code_string = str(0)+str(postal_code)
else:
postal_code_string = str(postal_code)
#all postal code URLs have the same format on this site
url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'
#try to get current URL
try:
response = requests.get(url, timeout = 5)
http = response.status_code
#print current for logging purposes
print url +" - HTTP: " + str(http)
#if valid webpage:
if http == 200:
#save html as text
html = response.text
#extra print statement for status updates
print "HTML ready"
#try to find two substrings in HTML text
#add the substring in between them to list w/ postal code
try:
found = re.search('population in 2011:</b> (.*)<br>', html).group(1)
#add to # scraped counter
total_scraped +=1
postal_code_data.append([postal_code_string,found])
#print statement for logging
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if substrings not found, try searching for others
#and doing the same as above
except AttributeError:
found = re.search('population in 2010:</b> (.*)<br>', html).group(1)
total_scraped +=1
postal_code_data.append([postal_code_string,found])
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if http =404, zip is not valid. Add to counter and print log
elif http == 404:
total_invalid +=1
print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."
#other http codes: add to error counter and print log
else:
errors +=1
print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."
#if get url fails by connnection error, add to error count & pass
except requests.exceptions.ConnectionError:
errors +=1
print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
pass
#if get url fails by timeout error, add to error count & pass
except requests.exceptions.Timeout:
errors +=1
print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
pass
#print final log/counter data, along with timestamp finished
now= datetime.datetime.now()
print now.strftime("%Y-%m-%d %H:%M")
print str(total_scraped) + " total zips scraped."
print str(total_invalid) + " total unavailable zips."
print str(errors) + " total errors."
source to share
No one has answered this question yet
Check out similar questions: