Robotparser Python module will not load "robots.txt"

I am writing a very simple web crawler and trying to parse 'robots.txt'

files. I found a module robotparser

in the standard library that should do just that. I am using Python 2.7.2. Unfortunately my code won't download the files 'robots.txt'

correctly and I can't figure out why.

Here is the relevant code snippet:

from urlparse import urlparse, urljoin
import robotparser

def get_all_links(page, url):
    links = []
    page_url = urlparse(url)
    base = page_url[0] + '://' + page_url[1]
    robots_url = urljoin(base, '/robots.txt')
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    for link in page.find_all('a'):
        link_url = link.get('href')
        print "Found a link: ", link_url
        if not rp.can_fetch('*', link_url):
            print "Page off limits!" 
            pass

      

Here page

is the parsed object by BeautifulSoup , and url

is the URL stored as a string. The parser reads an empty file 'robots.txt'

, not the one specified at the specified URL, and returns it True

to all can_fetch()

requests. It looks like it doesn't open the url or read the text file.

I also tried this in an interactive interpreter. This is what happens using the same syntax as the page.

Python 2.7.2 (default, Aug 18 2011, 18:04:39) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> url = 'http://www.udacity-forums.com/robots.txt'
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url(url)
>>> rp.read()
>>> print rp

>>> 

      

The line print rp

should print the contents of the file 'robots.txt'

, but it returns empty. Even more annoying, these examples work fine as written, but fail when I try my own url. I'm new to Python and I can't figure out what's going wrong. As far as I can tell, I use the module in the same way as the documentation and examples. Thanks for any help!

UPDATE 1: Here are some lines from the interpreter, in case print rp

it wasn't a good method of checking for existence 'robots.txt'

. path

, host

and url

, but the entries from 'robots.txt'

have not yet been read.

>>> rp
<robotparser.RobotFileParser instance at 0x1004debd8>
>>> dir(rp)
['__doc__', '__init__', '__module__', '__str__', '_add_entry', 'allow_all', 'can_fetch', 'default_entry', 'disallow_all', 'entries', 'errcode', 'host', 'last_checked', 'modified', 'mtime', 'parse', 'path', 'read', 'set_url', 'url']
>>> rp.path
'/robots.txt'
>>> rp.host
'www.udacity-forums.com'
>>> rp.entries
[]
>>> rp.url
'http://www.udacity-forums.com/robots.txt'
>>> 

      

UPDATE 2: I solved this issue by using this external library to parse files 'robots.txt'

. (But I didn't answer the original question!) After spending some more time in the terminal, I think I robotparser

can't handle certain additions to the spec 'robots.txt'

, for example Sitemap

, and has issues with blank lines. It will read in files like Stack Overflow and Python.org, but not Google, YouTube or my original Udacity file, which includes statements Sitemap

and blank lines. I would still appreciate it if someone smarter than me can confirm or explain this!

+3


source to share


2 answers


I solved this problem by using this external robots.txt parsing library. (But I didn't answer the original question!) After spending some more time in the terminal, I think the crawler robot can't handle certain additions to the "robots.txt" specification like Sitemap, and has issues with empty lines. It will read in files like Stack Overflow and Python.org, but not Google, YouTube, or my original Udacity file which includes Sitemap statements and blank lines. I would still appreciate it if someone smarter than me can confirm or explain this!



+2


source


The solution can use the reppy module

pip install reppy

      

Here are some examples:



In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

      

Voila!

0


source







All Articles