How can I get around the pageview restrictions when clearing web data using Python?

I am using Python to cleanse the US population zip data from http: /www.city-data.com via this directory: http://www.city-data.com/zipDir.html . The specific pages I'm trying to clean up are individual zip code pages with URLs like this: http://www.city-data.com/zips/01001.html . All of the individual zip pages I need to access have the same URL format, so my script just does for zip_code in the following range:

  • Creates the url specified zip code
  • Trying to get a response from the URL
  • If (2), check the HTTP of this url
  • If HTTP is 200, fetches HTML and dumps the data to the list
  • If HTTP is not 200, submit and count the error (Invalid postcode / URL)
  • If there is no response from url due to error please skip this post code and error counter
  • At the end of the script, print counter options and timestamp

The problem is when I run the script and it works fine for ~ 500 postcodes, then suddenly quits and returns repeated timeout errors. My suspicion is that the site server is restricting pageviews coming from my IP address, preventing me from filling in the number of clips I need to make (all 100,000 potential zip codes).

My question is this: is there a way to obfuscate the site server, for example by using some kind of proxy, so that it doesn't restrict my page views and I can clear all the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range 
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0 
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start 
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code) 
        else:
            postal_code_string = str(postal_code) 

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL 
        try: 
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:            

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above    
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log         
            elif http == 404: 
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now() 
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped." 
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."

      

+3


source to share





All Articles