Need help with this regex

Question

Need help with this regex

I am new to scrapy, I am trying to crawl a site with CrawlSpider, I want it to crawl it recursively based on the Next button. But it doesn't work. I think the problem is coming from the regex, but I've checked so many times, I can't find the error. It only scans the landing page without going to the next page.

# -*- coding: utf-8 -*-

start_urls = ['https://shopping.yahoo.com/merchantrating/?mid=13652']

rules = (
    Rule(LinkExtractor(allow = "/merchantrating/;_ylt=Anf3hF19R8MGFPwuYuJUny4cEb0F\?mid=13652&sort=1&start=\d+"), callback = 'parse_start_url', follow = True),
)

def parse_start_url(self, response):
    sel = Selector(response)
    contents = sel.xpath('//p')
    for content in contents:
        item = BedbugsItem()
        item['pageContent'] = content.xpath('text()').extract()
        self.items.append(item)
    return self.items

+3

python regex scrapy

PoppinDouble 30 oct. '14 at 8:18

source to share

1 answer

elias · Accepted Answer · 2014-10-30T09:13:36+0000

Use XPath instead:

rules = (
    Rule(LinkExtractor(
        restrict_xpaths = [
            "//div[@class='pagination']//a[contains(., 'Next')]"
        ]),
    callback = 'parse_start_url',
    follow = True),
)

Need help with this regex

More articles: