Using multiple start_urls in CrawlSpider

Question

Using multiple start_urls in CrawlSpider

I am using CrawlSpider to crawl a website. I have multiple start urls and each url has a "next link" linked to another similar page. I am using the rules to work with the next page.

rules = (
          Rule(SgmlLinkExtractor(allow = ('/',),
             restrict_xpaths=('//span[@class="next"]')),
             callback='parse_item',
             follow=True),
         )

When there is only url in start_urls, everything is fine. However, when there are many urls in start_urls, I got "Ignore response <404 url>: HTTP status code not processed or not allowed".

How can I start from the first url in start_urls after I figured out all the "next link" and then start from the second url in start_urls?

Here is my code

class DoubanSpider(CrawlSpider):
    name = "doubanBook"
    allowed_domains = ["book.douban.com"]
    category = codecs.open("category.txt","r",encoding="utf-8")

    start_urls = []
    for line in category:
        line = line.strip().rstrip()
        start_urls.append(line)

     rules = (
             Rule(SgmlLinkExtractor(allow = ('/',),
                 restrict_xpaths=('//span[@class="next"]')),
                 callback='parse_item',
                 follow=True),
              )


     def parse_item(self, response):
         sel = Selector(response)
         out = open("alllink.txt","a")
         sites = sel.xpath('//ul/li/div[@class="info"]/h2')
         for site in sites:
             href = site.xpath('a/@href').extract()[0]
             title = site.xpath('a/@title').extract()[0]
             out.write("***")
         out.close()

0

python scrapy

SilentCanon 01 nov. 14 at 4:55

source to share

No one has answered this question yet

See similar questions:

3

Scrapy error - HTTP status code not being processed or not resolved

or similar:

2035

Catching multiple exceptions on one line (except block)

994

How do I return multiple values from a function?

ten

Scraw CrawlSpider for AJAX content

6

Understanding Scraw CrawlSpider Rules

3

How can I use scrapy to parse links in JS?

1

Multiple Scrapy Crawler Domains Exit Without Error After Retrieving Source Pages

1

Scray CrawlSpider doesn't support deny rules

1

How do you pass additional parameters / values along with start_url for use in the CrawSpider?

0

CrawlSpider only scans start_urls

0

Scrapy - AJAX paginated spider

Using multiple start_urls in CrawlSpider

More articles: