Get links from two levels of sitemap.xml using Scrapy
I need to grab a post from a sitemap.xml file. The sitemap.xml file points to other sitemaps. My spider looks like this, which works fine with one of the sitemaps pointed to by the main sitemap.
class MySpider(SitemapSpider):
name = "example"
allowed_domains = ['www.example.com']
sitemap_urls = ["http://sitemaps.example.com/post-sitemap1.xml"]
sitemap_rules = [('\d{4}/\d{2}/\d{2}/\w+', 'parse_post')]
def parse_post(self, response):
item = PostItem()
item['url'] = response.url
return item
How can I get the spider to follow the sitemaps listed in the main sitemap? The main sitemap is as follows:
<sitemapindex>
<sitemap>
<loc>http://sitemaps.example.com/sitemap_recent.xml</loc>
<lastmod>2014-09-14T02:15:32-04:00</lastmod></sitemap>
<sitemap>
<loc>http://sitemaps.example.com/post-sitemap1.xml</loc>
<lastmod>2014-09-14T02:15:32-04:00</lastmod></sitemap>
</sitemap>
<sitemap>
<loc>http://sitemaps.example.com/post-sitemap2.xml</loc>
<lastmod>2014-02-10T22:50:43-05:00</lastmod>
</sitemap>
</sitemapindex>
+3
source to share
No one has answered this question yet
Check out similar questions: