Get links from two levels of sitemap.xml using Scrapy

I need to grab a post from a sitemap.xml file. The sitemap.xml file points to other sitemaps. My spider looks like this, which works fine with one of the sitemaps pointed to by the main sitemap.

class MySpider(SitemapSpider):
    name = "example"
    allowed_domains = ['www.example.com']

    sitemap_urls = ["http://sitemaps.example.com/post-sitemap1.xml"]
    sitemap_rules = [('\d{4}/\d{2}/\d{2}/\w+', 'parse_post')]

    def parse_post(self, response):
        item = PostItem()
        item['url'] = response.url
        return item

      

How can I get the spider to follow the sitemaps listed in the main sitemap? The main sitemap is as follows:

<sitemapindex>
    <sitemap>
        <loc>http://sitemaps.example.com/sitemap_recent.xml</loc>               
        <lastmod>2014-09-14T02:15:32-04:00</lastmod></sitemap>
    <sitemap>
        <loc>http://sitemaps.example.com/post-sitemap1.xml</loc>
         <lastmod>2014-09-14T02:15:32-04:00</lastmod></sitemap>  
    </sitemap>
    <sitemap>
          <loc>http://sitemaps.example.com/post-sitemap2.xml</loc>

          <lastmod>2014-02-10T22:50:43-05:00</lastmod>
    </sitemap> 
</sitemapindex>

      

+3
python sitemap scrapy scrapy-spider


source to share


No one has answered this question yet

Check out similar questions:

714
Get the difference between the two lists
516
Relative imports per billion
7
Separate output file for each url listed in spider start_urls list in scrapy
five
Scrapy. How do I change spider settings after starting a scan?
3
Magento, Split sitemap.xml and cron job
2
Search Engines and XML Sitemap
1
seo question: where should my blog sitemap.xml live
0
How to scan and cleanse one dataset from multiple linked pages using Scrapy
0
After submitting a new sitemap, google is still looking for old sitemaps
0
Convert XML Sitemap to HTML Sitemap HOW?



All Articles
Loading...
X
Show
Funny
Dev
Pics