How to create rules for crawlspider using scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from manga.items import MangaItem
class MangaHere(BaseSpider):
name = "mangah"
allowed_domains = ["mangahere.com"]
start_urls = ["http://www.mangahere.com/seinen/"]
def parse(self,response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li/div')
items = []
for site in sites:
rating = site.select("p/span/text()").extract()
if rating > 4.5:
item = MangaItem()
item["title"] = site.select("div/a/text()").extract()
item["desc"] = site.select("p[2]/text()").extract()
item["link"] = site.select("div/a/@href").extract()
item["rate"] = site.select("p/span/text()").extract()
items.append(item)
return items
My goal is to bypass www.mangahere.com/seinen or whatever else on this site. I want to go through each page and collect books that are above 4.5 rating. I started out as a foundational apparatus and tried copying and reading the tutorial, but it pretty much crossed my mind. I'm here to ask what I am doing to create my rules and how. I also can't get my state to work, the code either returns only the first item and stops regardless of the state, or it grabs everything, again regardless of the state. I know its probably pretty flawed code, but I'm still afraid to learn. Feel free to touch the code or offer other advice.
source to share
Strictly speaking, this doesn't answer the question, as my code uses BaseSpider
instead CrawlSpider
, but it fulfills the OP's requirement, so ...
Note:
- Since all pagination links are not available (you get the first nine and then the last two), I took a somewhat hacktastic approach. Using the first answer in the callback
parse
, I search for a reference with the class "next" (there is only one, so see which reference matches) and then find its immediately preceding sibling. This gives me an idea of the total number of pages in the sein category (currently 45). - We then return a Request object for the first page that the callback handles
parse_item
. - Then, given that we have determined that there are only 45 pages, we generate a whole series of Request objects for "./seinen/2.htm" all the way to "./seinen/45.htm".
- Since it
rating
is a list and that its values are floats (which I had to implement based on condition 4.5), the way to fix the error that occurred is to iterate over the list of ratings and throw each element should be floating.
Anyway, take a look at the following code and see if it makes sense. In theory, you should be able to easily extend this code to clear multiple categories, although that remains as an exercise for the OP. :)
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from tutorial.items import MangaItem
from urlparse import urlparse
class MangaHere(BaseSpider):
name = "mangah2"
start_urls = ["http://www.mangahere.com/seinen/"]
allowed_domains = ["mangahere.com"]
def parse(self, response):
# get index depth ie the total number of pages for the category
hxs = HtmlXPathSelector(response)
next_link = hxs.select('//a[@class="next"]')
index_depth = int(next_link.select('preceding-sibling::a[1]/text()').extract()[0])
# create a request for the first page
url = urlparse("http://www.mangahere.com/seinen/")
yield Request(url.geturl(), callback=self.parse_item)
# create a request for each subsequent page in the form "./seinen/x.htm"
for x in xrange(2, index_depth):
pageURL = "http://www.mangahere.com/seinen/%s.htm" % x
url = urlparse(pageURL)
yield Request(url.geturl(), callback=self.parse_item)
def parse_item(self,response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li/div')
items = []
for site in sites:
rating = site.select("p/span/text()").extract()
for r in rating:
if float(r) > 4.5:
item = MangaItem()
item["title"] = site.select("div/a/text()").extract()
item["desc"] = site.select("p[2]/text()").extract()
item["link"] = site.select("div/a/@href").extract()
item["rate"] = site.select("p/span/text()").extract()
items.append(item)
return items
source to share