Fetching HTML results using XPath fails in Scrapy as content is loaded dynamically

Unlike the previous question, Extracting p inside h1 with Python / Scrapy , I ran into a situation where Scrapy (for Python) would not extract the span tag in the h4 tag.

Example HTML:

<div class="event-specifics">
 <div class="event-location">
  <h3>   Gourmet Matinee </h3>
  <h4>
   <span id="spanEventDetailPerformanceLocation">Knight Grove</span>
  </h4>
</div>
</div>

      

I am trying to grab the "Knight Grove" text in span tags. When using scrapy shell on command line

response.xpath('.//div[@class="event-location"]//span//text()').extract()

      

returns:

['Knight Grove']

      

and

response.xpath('.//div[@class="event-location"]/node()')

      

returns the whole node, namely:

['\n                    ', '<h3>\n                        Gourmet Matinee</h3>', '\n                    ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n                ']

      

BUT when the same Xpath is run inside the spider nothing is returned. Take for example the following spider code written to clean up the page from which the above HTML sample was taken, https://www.clevelandorchestra.com/17-blossom--summer/1718-gourmet-matinees/2017-07- 11-gourmet-matinee / . (Some of the code is being removed as it is not relevant to the question):

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse

class ClevelandOrchestra(CrawlSpider):
    name = 'clev2'
    allowed_domains = ['clevelandorchestra.com']

    start_urls = ['https://www.clevelandorchestra.com/']

    rules = (
         Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
     thisconcert = ItemLoader(item=Concert(), response=response)
     for concert in response.xpath('.//div[@class="event-wrap"]'): 

        thisconcert.add_xpath('location','.//div[@class="event-location"]//span//text()')

     return thisconcert.load_item()

      

This does not return the ['location'] element. I've also tried:

thisconcert.add_xpath('location','.//div[@class="event-location"]/node()')

      

Contrary to the above question regarding p within h, are span tags valid in h tags in HTML, if I'm not mistaken?

For clarity, the location field is defined in the Concert () object, and I've disabled all pipelines for troubleshooting.

It is possible that the spacing inside h4 is invalid HTML; if not, what could be the reason for this?

Interestingly, doing the same task using add_css () like this:

thisconcert.add_css('location','.event-location')

      

gives a node with existing span tags, but no inner text:

['<div class="event-location">\r\n'
          '                    <h3>\r\n'
          '                        BLOSSOM MUSIC FESTIVAL </h3>\r\n'
          '                    <h4><span '
          'id="spanEventDetailPerformanceLocation"></span></h4>\r\n'
          '                </div>']

      

To confirm that this is not a duplicate: this is true in this particular example, there is a p tag inside a span tag that is inside an h4 tag; however the same behavior occurs when the p tag is not involved, for example: https://www.clevelandorchestra.com/1718-concerts-pdps/1718-rental-concerts/1718-rentals-other/2017-07-21-cooper- competition /? performanceNumber = 16195 .

+3


source to share


1 answer


This content is loaded via an Ajax call. To get the data, you need to make a similar request POST

and don't forget to add headers with content type:, headers = {'content-type': "application/json"}

and you will receive a Json file response.enter image description here



import requests

url = "https://www.clevelandorchestra.com/Services/PerformanceService.asmx/GetToolTipPerformancesForCalendar"
payload = {"startDate": "2017-06-30T21:00:00.000Z", "endDate": "2017-12-31T21:00:00.000Z"}
headers = {'content-type': "application/json"}

json_response = requests.post(url, json=payload, headers=headers).json()
for performance in json_response['d']:
    print(performance["performanceName"], performance["dateString"])

# Star-Spangled Spectacular Friday, June 30, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Saturday, July 1, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Sunday, July 2, 2017
# Blossom: A Salute to America Monday, July 3, 2017
# Blossom: A Salute to America Tuesday, July 4, 2017

      

+2


source







All Articles