XPath - extract text between two nodes

Question

XPath - extract text between two nodes

I am facing a problem with my XPath query. I need to parse a div that is divisible into an unknown number of "sections". Each is separated by h5 with a section name. The list of possible section names is known, and each of them can appear only once. In addition, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".

Html

<div class="some-class">
 <h5>FirstHeader</h5>
  text1
 <h5>SecondHeader</h5>
  text2a<br>
  text2b
 <h5>ThirdHeader</h5>
  text3a<br>
  text3b<br>
  text3c<br>
 <h5>FourthHeader</h5>
  text4
</div>

Expected Output (for SecondSection)

['text2a', 'text2b']

Request # 1

//text()[following-sibling::h5/text()='ThirdHeader']

Result # 1

['text1', 'text2a', 'text2b']

This is clearly too much, so I decided to limit the result to the content between the selected title and the title earlier.

Request # 2

//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']

Result # 2

['text2a', 'text2b']

The results are in line with expectations. However, this cannot be used - I don't know if SecondHeader / ThirdHeader will exist on the parsed page or not. Only one section header needs to be used in the request.

Request No. 3

//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]

Result No. 3

[]

Could you tell me what I am doing wrong? I tested it on Google Chrome.

+1

xpath

mimol Feb 24 16 at 21:43

source to share

2 answers

If all the elements h5

and text nodes are siblings and you need to group by section, then a possible option is to just select the text nodes by count h5

that used to be.

Usage example lxml

(in Python)

>>> import lxml.html
>>> s = '''
... <div class="some-class">
...  <h5>FirstHeader</h5>
...   text1
...  <h5>SecondHeader</h5>
...   text2a<br>
...   text2b
...  <h5>ThirdHeader</h5>
...   text3a<br>
...   text3b<br>
...   text3c<br>
...  <h5>FourthHeader</h5>
...   text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n  text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n  text2a', '\n  text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n  text3a', '\n  text3b', '\n  text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n  text4\n']
>>>

+2

paul trmbrth Feb 24 16 at 22:54

source to share

Daniel Haley · Accepted Answer · 2016-02-24T23:04:20+0000

You should just check the first preceding sibling h5

...

//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

XPath - extract text between two nodes

More articles: