? apple
banana
watermelon
orange Assuming above, how can Xpa...">

How do I get the xpath of text between or or <br/">?

</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange

      

Assuming above, how can Xpath be used to capture each fruit? Must use xpath of some kind.

Should I use substring-after(following-sibling...)

?

EDIT: I am using the Nokogiri parser.

+2


source to share


3 answers


Well, you can use "//br/text()"

, but this will return all text nodes in tags <br>

. But since the above is not a well-formed xml, I'm not sure how you are going to use the xpath on it. Regular expression is usually bad for html, but there are html (not xhtml) parsers available. I hesitate to suggest one for the ruby ​​simply because it is not "my area" and I will just search ...



+4


source


There are several questions here:

  • XPath works with XML - you have HTML that is not XML (basically the tags don't match, so the XML parser throws an exception when you give it that text)

  • XPath usually works by detecting attributes within tags as well. If your tags <br>

    don't actually contain text, they just sit in between, that would be tricky too



Because of this, you probably want to use XPath (or similar) to get the content of the div, then split the string based on <br>

occurrences.

As you pointed out this question with ruby, I would suggest looking into hpricot as it is a really good and fast HTML (and XML) parsing library that should be a lot more useful than running away from XPath

+1


source


Try the following, which gets all text siblings from tags <br>

as an array of strings stripped from trailing and leading spaces:

require 'rubygems'
reguire 'nokogiri'

doc = Nokogiri::HTML(DATA)

fruits =
  doc.xpath('//br/following-sibling::text()
           | //br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end

puts fruits

__END__
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange

      

Is this what you want?

+1


source







All Articles