Behavior of the xpath selector to check on h1-h6 tags

Question

Behavior of the xpath selector to check on h1-h6 tags

Why are the following two pieces of code giving different outputs? The only difference between the two is that the tag h1

in the first case is replaced by the tag h

in the second case. Is it because the tag h1

has a special "meaning" in the html? I tried with h1

through h6

, and they all give []

as output, and when h7

it starts to [u'xxx']

output as output.

from scrapy import Selector # scrapy version: 1.2.2

text = '<h1><p>xxx</p></h1>'
print Selector(text=text).xpath('//h1/p/text()').extract()
Output[1]: []

text = '<h><p>xxx</p></h>'
print Selector(text=text).xpath('//h/p/text()').extract()
Output[2]: [u'xxx']

0

python html xpath selector scrapy

FJDU 09 dec. 16 at 15:40

source to share

2 answers

The short answer is that h1

.. h6

should not contain <p>

in well-formed HTML documents, at least lxml (which makes Scrapy Selectors) dislike when parsing HTML. lxml handles bad formatting, but it is slightly different in this case.

You can check how lxml parses and serializes the HTML snippet:

>>> from scrapy import Selector
>>> text = '<h1><p>xxx</p></h1>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h1></h1><p>xxx</p></body></html>

So, when lxml comes across a tag p

in h1

, it puts it after it. The element is p

not lost, but not where you would expect it when reading the HTML source.

against another snippet:

>>> text = '<h><p>xxx</p></h>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h><p>xxx</p></h></body></html>
>>>

The elements

h

don't mean anything special to lxml, so " p

inside h

" is fine.

+1

paul trmbrth 09 dec. 16 at 15:55

source to share

eLRuLL · Accepted Answer · 2016-12-09T15:55:59+0000

Including tags p

inside is h#

invalid according to the W3C. You can see more about it here

Anyway, to get around this and just work with any structure xml

, you can simply change type

like this:

sel = Selector(text="anyxml", type="xml")

This will respect any xml structure.

Behavior of the xpath selector to check on h1-h6 tags

More articles: