How do I get the xpath of text between or or <br/">?

Question

How do I get the xpath of text between or or <br/">?

</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange

Assuming above, how can Xpath be used to capture each fruit? Must use xpath of some kind.

Should I use substring-after(following-sibling...)

?

EDIT: I am using the Nokogiri parser.

+2

ruby xpath

asdfasdfa 28 Sep '09 at 3:55

source to share

3 answers

Marc gravell · Answer 1 · 2009-09-28T03:59:27+0000

Well, you can use "//br/text()"

, but this will return all text nodes in tags <br>

. But since the above is not a well-formed xml, I'm not sure how you are going to use the xpath on it. Regular expression is usually bad for html, but there are html (not xhtml) parsers available. I hesitate to suggest one for the ruby simply because it is not "my area" and I will just search ...

Orion edwards · Answer 2 · 2009-09-28T04:02:18+0000

There are several questions here:

XPath works with XML - you have HTML that is not XML (basically the tags don't match, so the XML parser throws an exception when you give it that text)
XPath usually works by detecting attributes within tags as well. If your tags <br>

don't actually contain text, they just sit in between, that would be tricky too

Because of this, you probably want to use XPath (or similar) to get the content of the div, then split the string based on <br>

occurrences.

As you pointed out this question with ruby, I would suggest looking into hpricot as it is a really good and fast HTML (and XML) parsing library that should be a lot more useful than running away from XPath

andre-r · Answer 3 · 2009-09-28T13:48:58+0000

Try the following, which gets all text siblings from tags <br>

as an array of strings stripped from trailing and leading spaces:

require 'rubygems'
reguire 'nokogiri'

doc = Nokogiri::HTML(DATA)

fruits =
  doc.xpath('//br/following-sibling::text()
           | //br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end

puts fruits

__END__
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange

Is this what you want?

How do I get the xpath of text between or or <br/">?

More articles: