Robotparser Python module will not load "robots.txt"
I am writing a very simple web crawler and trying to parse 'robots.txt'
files. I found a module robotparser
in the standard library that should do just that. I am using Python 2.7.2. Unfortunately my code won't download the files 'robots.txt'
correctly and I can't figure out why.
Here is the relevant code snippet:
from urlparse import urlparse, urljoin
import robotparser
def get_all_links(page, url):
links = []
page_url = urlparse(url)
base = page_url[0] + '://' + page_url[1]
robots_url = urljoin(base, '/robots.txt')
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
for link in page.find_all('a'):
link_url = link.get('href')
print "Found a link: ", link_url
if not rp.can_fetch('*', link_url):
print "Page off limits!"
pass
Here page
is the parsed object by BeautifulSoup , and url
is the URL stored as a string. The parser reads an empty file 'robots.txt'
, not the one specified at the specified URL, and returns it True
to all can_fetch()
requests. It looks like it doesn't open the url or read the text file.
I also tried this in an interactive interpreter. This is what happens using the same syntax as the page.
Python 2.7.2 (default, Aug 18 2011, 18:04:39)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> url = 'http://www.udacity-forums.com/robots.txt'
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url(url)
>>> rp.read()
>>> print rp
>>>
The line print rp
should print the contents of the file 'robots.txt'
, but it returns empty. Even more annoying, these examples work fine as written, but fail when I try my own url. I'm new to Python and I can't figure out what's going wrong. As far as I can tell, I use the module in the same way as the documentation and examples. Thanks for any help!
UPDATE 1: Here are some lines from the interpreter, in case print rp
it wasn't a good method of checking for existence 'robots.txt'
. path
, host
and url
, but the entries from 'robots.txt'
have not yet been read.
>>> rp
<robotparser.RobotFileParser instance at 0x1004debd8>
>>> dir(rp)
['__doc__', '__init__', '__module__', '__str__', '_add_entry', 'allow_all', 'can_fetch', 'default_entry', 'disallow_all', 'entries', 'errcode', 'host', 'last_checked', 'modified', 'mtime', 'parse', 'path', 'read', 'set_url', 'url']
>>> rp.path
'/robots.txt'
>>> rp.host
'www.udacity-forums.com'
>>> rp.entries
[]
>>> rp.url
'http://www.udacity-forums.com/robots.txt'
>>>
UPDATE 2: I solved this issue by using this external library to parse files 'robots.txt'
. (But I didn't answer the original question!) After spending some more time in the terminal, I think I robotparser
can't handle certain additions to the spec 'robots.txt'
, for example Sitemap
, and has issues with blank lines. It will read in files like Stack Overflow and Python.org, but not Google, YouTube or my original Udacity file, which includes statements Sitemap
and blank lines. I would still appreciate it if someone smarter than me can confirm or explain this!
source to share
I solved this problem by using this external robots.txt parsing library. (But I didn't answer the original question!) After spending some more time in the terminal, I think the crawler robot can't handle certain additions to the "robots.txt" specification like Sitemap, and has issues with empty lines. It will read in files like Stack Overflow and Python.org, but not Google, YouTube, or my original Udacity file which includes Sitemap statements and blank lines. I would still appreciate it if someone smarter than me can confirm or explain this!
source to share
The solution can use the reppy module
pip install reppy
Here are some examples:
In [1]: import reppy
In [2]: x = reppy.fetch("http://google.com/robots.txt")
In [3]: x.atts
Out[3]:
{'agents': {'*': <reppy.agent at 0x1fd9610>},
'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
'http://www.google.com/hostednews/sitemap_index.xml',
'http://www.google.com/sitemaps_webmasters.xml',
'http://www.google.com/ventures/sitemap_ventures.xml',
'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}
In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it allowed.
Out[4]: True
In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it not allowed.
Out[5]: False
In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it allowed.
Out[7]: True
In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?
In [9]: x.ttl
Out[9]: 3721.3556718826294
Voila!