Scag crawl spider ajax pagination
I was trying to truncate a link that has a pagination ajax call. I am trying to crawl the link http://www.demo.com . and in the .py file I provided this code for XPATH constraint and encoding:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import sumSpider, Rule
from scrapy.selector import HtmlXPathSelector
from sum.items import sumItem
class Sumspider1(sumSpider):
name = 'sumDetailsUrls'
allowed_domains = ['sum.com']
start_urls = ['http://www.demo.com']
rules = (
Rule(LinkExtractor(restrict_xpaths='.//ul[@id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
)
#use parse_start_url if your spider wants to crawl from first page , so overriding
def parse_start_url(self, response):
print '********************************************1**********************************************'
#//div[@class="showMoreCars hide"]/a
#.//ul[@id="pager"]/li[8]/a/@href
self.log('Inside - parse_item %s' % response.url)
hxs = HtmlXPathSelector(response)
item = sumItem()
item['page'] = response.url
title = hxs.xpath('.//h1[@class="page-heading"]/text()').extract()
print '********************************************title**********************************************',title
urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
print '**********************************************2***url*****************************************',urls
finalurls = []
for url in urls:
print '---------url-------',url
finalurls.append(url)
item['urls'] = finalurls
return item
My items.py file contains
from scrapy.item import Item, Field
class sumItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
page = Field()
urls = Field()
However, I am not getting the exact output that all pages cannot get when scanning.
source to share
I hope the below code helps.
somespider.py
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver
def removeUnicodes(strData):
if(strData):
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
return strData
class demoSpider(scrapy.Spider):
name = "domainurls"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/used/cars-in-trichy/']
def __init__(self):
self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)
def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(5)
hxs = Selector(response)
item = DemoItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')
try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')
for url in urls:
url = url.get_attribute("href")
finalurls.append(removeUnicodes(url))
item['urls'] = finalurls
except:
break
self.driver.close()
return item
items.py
from scrapy.item import Item, Field
class DemoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()
Note: You need to start the selenium rc server because HTMLUNITWITHJS works with selenium rc using Python only.
Start the selenium rc server by issuing the command :
java -jar selenium-server-standalone-2.44.0.jar
Start your spider with the command :
spider crawl domainurls -o someoutput.json
source to share
You can check in your browser how the requests are being made.
Behind the scenes, just after you click on this show more cars button, your browser will ask for JSON data to serve the next page. You can take advantage of this fact and directly access JSON data without having to work with a JavaScript engine like Selenium or PhantomJS.
In your case, as a first step, you should simulate a user scrolling through the page given by the start_url parameter and profile, and at the same time, your network is requesting the discovery of the endpoint used by the browser for the JSON request. To discover this endpoint in general, there is an XHR (XMLHttpRequest) section in the browser profile section, like here in Safari, where you can navigate through the thought of all the resources / endpoints used to request data.
Once you find this endpoint, it's a simple task: you provide your Spider as start_url with the endpoint you just discovered, and according to your process and navigate through the JSON, you can see if it is requesting the next page.
PS: I saw for you that the endpoint url is http://www.carwale.com/webapi/classified/stockfilters/?city=194&kms=0-&year=0-&budget=0-&pn=2
In this case, my browser requested the second page, as you can see in the pn parameter. It is important that you set some header parameters before sending the request. I noticed in your case the headers:
Accept text / plain, /; d = 0.01
Referer http://www.carwale.com/used/cars-in-trichy/
X-Requested-With XMLHttpRequest
sourceid 1
User-Agent Mozilla / 5.0 ...
source to share
The ata Harrier is a nearly complete SUV on India 's Best SUVs list under 20 lakh 2019-2020. The best SUV is based on the Land Rover Discovery Sport platform and has been modified to suit Indian road conditions.
source to share