Finding all elements before h2 element in hpricot / nokogiri
I am trying to parse a Wiktionary entry to get all the English definitions. I can extract all the definitions, the problem is some definitions are in other languages. What I would like to do is somehow get just a block of HTML with English definitions. I found that in case there are other language entries, the title after defining the English language can be obtained with:
header = (doc/"h2")[3]
So I would only like to search for all elements before this title element. I thought it was possible with help header.preceding_siblings()
, but it doesn't seem to work. Any suggestions?
You can use the visitor template with Nokogiri. This code will remove everything from the definition of another h2 language:
require 'nokogiri'
require 'open-uri'
class Visitor
def initialize(node)
@node = node
end
def visit(node)
if @remove || @node == node
node.remove
@remove = true
return
end
node.children.each do |child|
child.accept(self)
end
end
end
doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2] #In this case, the Italian h2 is at index 2. Your page may differ
doc.root.accept(Visitor.new(node)) #Removes all page contents starting from node
source to share
The following code uses Hpricot .
It gets the text from the heading for English (h2) up to the next heading (h2) or up to the footer if there are no additional languages:
require 'hpricot'
require 'open-uri'
def get_english_definition(url)
doc = Hpricot(open(url))
span = doc.at('h2/span[@class="mw-headline"][text()=English]')
english_header = span && span.parent
return nil unless english_header
next_header_or_footer =
Hpricot::Elements[*english_header.following_siblings].at('h2') ||
doc.at('[@class="printfooter"]')
Hpricot::Elements.expand(english_header.next_node,
next_header_or_footer.previous_node).to_s
end
Example:
get_english_definition "http://en.wiktionary.org/wiki/gift"
source to share