XPath - extract text between two nodes
I am facing a problem with my XPath query. I need to parse a div that is divisible into an unknown number of "sections". Each is separated by h5 with a section name. The list of possible section names is known, and each of them can appear only once. In addition, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".
Html
<div class="some-class">
<h5>FirstHeader</h5>
text1
<h5>SecondHeader</h5>
text2a<br>
text2b
<h5>ThirdHeader</h5>
text3a<br>
text3b<br>
text3c<br>
<h5>FourthHeader</h5>
text4
</div>
Expected Output (for SecondSection)
['text2a', 'text2b']
Request # 1
//text()[following-sibling::h5/text()='ThirdHeader']
Result # 1
['text1', 'text2a', 'text2b']
This is clearly too much, so I decided to limit the result to the content between the selected title and the title earlier.
Request # 2
//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']
Result # 2
['text2a', 'text2b']
The results are in line with expectations. However, this cannot be used - I don't know if SecondHeader / ThirdHeader will exist on the parsed page or not. Only one section header needs to be used in the request.
Request No. 3
//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]
Result No. 3
[]
Could you tell me what I am doing wrong? I tested it on Google Chrome.
source to share
If all the elements h5
and text nodes are siblings and you need to group by section, then a possible option is to just select the text nodes by count h5
that used to be.
Usage example lxml
(in Python)
>>> import lxml.html
>>> s = '''
... <div class="some-class">
... <h5>FirstHeader</h5>
... text1
... <h5>SecondHeader</h5>
... text2a<br>
... text2b
... <h5>ThirdHeader</h5>
... text3a<br>
... text3b<br>
... text3c<br>
... <h5>FourthHeader</h5>
... text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n text2a', '\n text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n text3a', '\n text3b', '\n text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n text4\n']
>>>
source to share