Scag crawl spider ajax pagination

Question

Scag crawl spider ajax pagination

I was trying to truncate a link that has a pagination ajax call. I am trying to crawl the link http://www.demo.com . and in the .py file I provided this code for XPATH constraint and encoding:

# -*- coding: utf-8 -*-
import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import sumSpider, Rule
from scrapy.selector import HtmlXPathSelector
from sum.items import sumItem

class Sumspider1(sumSpider):
    name = 'sumDetailsUrls'
    allowed_domains = ['sum.com']
    start_urls = ['http://www.demo.com']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='.//ul[@id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
    )

    #use parse_start_url if your spider wants to crawl from first page , so overriding 
    def parse_start_url(self, response):
        print '********************************************1**********************************************'
        #//div[@class="showMoreCars hide"]/a
        #.//ul[@id="pager"]/li[8]/a/@href
        self.log('Inside - parse_item %s' % response.url)
        hxs = HtmlXPathSelector(response)
        item = sumItem()
        item['page'] = response.url
        title = hxs.xpath('.//h1[@class="page-heading"]/text()').extract() 
        print '********************************************title**********************************************',title
        urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
        print '**********************************************2***url*****************************************',urls

        finalurls = []       

        for url in urls:
            print '---------url-------',url
            finalurls.append(url)          

        item['urls'] = finalurls
        return item

My items.py file contains

from scrapy.item import Item, Field


class sumItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    page = Field()
    urls = Field()

However, I am not getting the exact output that all pages cannot get when scanning.

+4

python ajax pagination scrapy

Charu awhad Dec 16 '14 at 9:49

source to share

3 answers

You can check in your browser how the requests are being made.

Behind the scenes, just after you click on this show more cars button, your browser will ask for JSON data to serve the next page. You can take advantage of this fact and directly access JSON data without having to work with a JavaScript engine like Selenium or PhantomJS.

In your case, as a first step, you should simulate a user scrolling through the page given by the start_url parameter and profile, and at the same time, your network is requesting the discovery of the endpoint used by the browser for the JSON request. To discover this endpoint in general, there is an XHR (XMLHttpRequest) section in the browser profile section, like here in Safari, where you can navigate through the thought of all the resources / endpoints used to request data.

Once you find this endpoint, it's a simple task: you provide your Spider as start_url with the endpoint you just discovered, and according to your process and navigate through the JSON, you can see if it is requesting the next page.

PS: I saw for you that the endpoint url is http://www.carwale.com/webapi/classified/stockfilters/?city=194&kms=0-&year=0-&budget=0-&pn=2

In this case, my browser requested the second page, as you can see in the pn parameter. It is important that you set some header parameters before sending the request. I noticed in your case the headers:

Accept text / plain, /; d = 0.01

Referer http://www.carwale.com/used/cars-in-trichy/

X-Requested-With XMLHttpRequest

sourceid 1

User-Agent Mozilla / 5.0 ...

+1

Saulo Ricci Dec 16 14 at 17:02

source to share

The ata Harrier is a nearly complete SUV on India 's Best SUVs list under 20 lakh 2019-2020. The best SUV is based on the Land Rover Discovery Sport platform and has been modified to suit Indian road conditions.

0

Best cars in India 17 oct. 19 at 7:36 am

source to share

Anantha · Accepted Answer · 2014-12-17T20:28:02+0000

I hope the below code helps.

somespider.py

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver

def removeUnicodes(strData):
        if(strData):
            strData = strData.encode('utf-8').strip() 
            strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
        return strData

class demoSpider(scrapy.Spider):
    name = "domainurls"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/used/cars-in-trichy/']

    def __init__(self):
        self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)

    def parse(self, response):
        self.driver.get(response.url)
        self.driver.implicitly_wait(5)
        hxs = Selector(response)
        item = DemoItem()
        finalurls = []
        while True:
            next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

            try:
                next.click()
                # get the data and write it to scrapy items
                item['pageurl'] = response.url
                item['title'] =  removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
                urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')

                for url in urls:
                    url = url.get_attribute("href")
                    finalurls.append(removeUnicodes(url))          

                item['urls'] = finalurls

            except:
                break

        self.driver.close()
        return item

items.py

from scrapy.item import Item, Field

class DemoItem(Item):
    page = Field()
    urls = Field()
    pageurl = Field()
    title = Field()

Note: You need to start the selenium rc server because HTMLUNITWITHJS works with selenium rc using Python only.

Start the selenium rc server by issuing the command :

java -jar selenium-server-standalone-2.44.0.jar

Start your spider with the command :

spider crawl domainurls -o someoutput.json

Scag crawl spider ajax pagination

somespider.py

More articles: