Behavior of the xpath selector to check on h1-h6 tags
Why are the following two pieces of code giving different outputs? The only difference between the two is that the tag h1
in the first case is replaced by the tag h
in the second case. Is it because the tag h1
has a special "meaning" in the html? I tried with h1
through h6
, and they all give []
as output, and when h7
it starts to [u'xxx']
output as output.
from scrapy import Selector # scrapy version: 1.2.2
text = '<h1><p>xxx</p></h1>'
print Selector(text=text).xpath('//h1/p/text()').extract()
Output[1]: []
text = '<h><p>xxx</p></h>'
print Selector(text=text).xpath('//h/p/text()').extract()
Output[2]: [u'xxx']
source to share
The short answer is that h1
.. h6
should not contain <p>
in well-formed HTML documents, at least lxml (which makes Scrapy Selectors) dislike when parsing HTML. lxml handles bad formatting, but it is slightly different in this case.
You can check how lxml parses and serializes the HTML snippet:
>>> from scrapy import Selector
>>> text = '<h1><p>xxx</p></h1>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h1></h1><p>xxx</p></body></html>
So, when lxml comes across a tag p
in h1
, it puts it after it. The element is p
not lost, but not where you would expect it when reading the HTML source.
against another snippet:
>>> text = '<h><p>xxx</p></h>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h><p>xxx</p></h></body></html>
>>>
The elements h
don't mean anything special to lxml, so " p
inside h
" is fine.
source to share