Extracting p to h1 with Python / Scrapy

I am using Scrapy to fetch some data about music concerts from websites. At least one website I'm working with using (wrong, according to W3C - Is it valid to have paragraph elements inside a heading tag in HTML5 (P inside H1)? ) In an h1 element. I need to extract the text inside the p element, however, and can't figure out how to do that.

I read the documentation and looked at usage for example, but relatively new to Scrapy. I understand that the solution has something to do with setting the Selector type to "xml" rather than "html" to recognize any XML tree, but for the life of me I can't figure out how or where to do it in this case.

For example, a website has the following HTML:

<h1 class="performance-title">
<p>Bernard Haitink conducts Brahms and&nbsp;Dvoล™รกk featuring pianist     Emanuel Ax
</p>
</h1>

      

I created an item called Concert () that has a value called "title". In my object loader, I am using:

def parse_item(self, response):       
    thisconcert = ItemLoader(item=Concert(), response=response)
    thisconcert.add_xpath('title','//h1[@class="performance-title"]/p/text()')

    return thisconcert.load_item()

      

This returns in the ['title'] element a unicode list that does not include the text inside the p element, for example:

['\n                 ', '\n                 ', '\n                ']

      

I understand why, but I don't know how to get around this. I've also tried things like:

from scrapy import Selector

def parse_item(self, response):  

    s = Selector(text=' '.join(response.xpath('.//section[@id="performers"]/text()').extract()), type='xml')

      

What am I doing wrong here and how can I parse the HTML containing this problem (p inside h1)?

I have referenced information pertaining to this particular issue on the scrapy selection selector behavior on h1-h6 tags , but it does not provide a complete solution that can be applied to the spider, just an in-session example using a given text string.

0


source to share


2 answers


It was pretty difficult. To be frank, I still don't understand why this is happening. It turned out that the tag <p>

that should be contained in the tag <h1>

is wrong. Curl to render the form site <h1><p> </p></h1>

, whereas the response received from the site shows it as:

<h1 class="performance-title">\n</h1>
<p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax
</p>

      

As I mentioned, I have my doubts, but nothing specific. Anyway the xpath to get the text inside the tag <p>

is hence there:



response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()

      

This is using <h1 class="performance-title">

as a guide and finding its tag<p>

+1


source


//*[@id="content"]/section/article/section[2]/h1/p/text()

      



0


source







All Articles