How to create rules for crawlspider using scrapy

Question

How to create rules for crawlspider using scrapy

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from manga.items import MangaItem

class MangaHere(BaseSpider):
    name = "mangah"
    allowed_domains = ["mangahere.com"]
    start_urls = ["http://www.mangahere.com/seinen/"]

    def parse(self,response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li/div')
        items = []
        for site in sites:
            rating = site.select("p/span/text()").extract()
            if rating > 4.5:
                item = MangaItem()
                item["title"] = site.select("div/a/text()").extract()
                item["desc"] = site.select("p[2]/text()").extract()
                item["link"] = site.select("div/a/@href").extract()
                item["rate"] = site.select("p/span/text()").extract()
                items.append(item)

        return items

My goal is to bypass www.mangahere.com/seinen or whatever else on this site. I want to go through each page and collect books that are above 4.5 rating. I started out as a foundational apparatus and tried copying and reading the tutorial, but it pretty much crossed my mind. I'm here to ask what I am doing to create my rules and how. I also can't get my state to work, the code either returns only the first item and stops regardless of the state, or it grabs everything, again regardless of the state. I know its probably pretty flawed code, but I'm still afraid to learn. Feel free to touch the code or offer other advice.

+3

python web-crawler scrapy

gallly 19 jan. 13 at 18:38

source to share

1 answer

Talvalin · Accepted Answer · 2013-01-20T00:59:56+0000

Strictly speaking, this doesn't answer the question, as my code uses BaseSpider

instead CrawlSpider

, but it fulfills the OP's requirement, so ...

Note:

Since all pagination links are not available (you get the first nine and then the last two), I took a somewhat hacktastic approach. Using the first answer in the callback parse

, I search for a reference with the class "next" (there is only one, so see which reference matches) and then find its immediately preceding sibling. This gives me an idea of the total number of pages in the sein category (currently 45).
We then return a Request object for the first page that the callback handles parse_item

.
Then, given that we have determined that there are only 45 pages, we generate a whole series of Request objects for "./seinen/2.htm" all the way to "./seinen/45.htm".
Since it rating

is a list and that its values are floats (which I had to implement based on condition 4.5), the way to fix the error that occurred is to iterate over the list of ratings and throw each element should be floating.

Anyway, take a look at the following code and see if it makes sense. In theory, you should be able to easily extend this code to clear multiple categories, although that remains as an exercise for the OP. :)

from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from tutorial.items import MangaItem
from urlparse import urlparse

class MangaHere(BaseSpider):
    name = "mangah2"
    start_urls = ["http://www.mangahere.com/seinen/"]
    allowed_domains = ["mangahere.com"]

    def parse(self, response):
        # get index depth ie the total number of pages for the category
        hxs = HtmlXPathSelector(response)
        next_link = hxs.select('//a[@class="next"]')
        index_depth = int(next_link.select('preceding-sibling::a[1]/text()').extract()[0])

        # create a request for the first page
        url = urlparse("http://www.mangahere.com/seinen/")
        yield Request(url.geturl(), callback=self.parse_item)

        # create a request for each subsequent page in the form "./seinen/x.htm"
        for x in xrange(2, index_depth):
            pageURL = "http://www.mangahere.com/seinen/%s.htm" % x
            url = urlparse(pageURL)
            yield Request(url.geturl(), callback=self.parse_item)

    def parse_item(self,response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li/div')
        items = []
        for site in sites:
            rating = site.select("p/span/text()").extract()
            for r in rating:
                if float(r) > 4.5:
                    item = MangaItem()
                    item["title"] = site.select("div/a/text()").extract()
                    item["desc"] = site.select("p[2]/text()").extract()
                    item["link"] = site.select("div/a/@href").extract()
                    item["rate"] = site.select("p/span/text()").extract()
                    items.append(item)
        return items

How to create rules for crawlspider using scrapy

More articles: