Finding all elements before h2 element in hpricot / nokogiri

Question

Finding all elements before h2 element in hpricot / nokogiri

I am trying to parse a Wiktionary entry to get all the English definitions. I can extract all the definitions, the problem is some definitions are in other languages. What I would like to do is somehow get just a block of HTML with English definitions. I found that in case there are other language entries, the title after defining the English language can be obtained with:

header = (doc/"h2")[3]

So I would only like to search for all elements before this title element. I thought it was possible with help header.preceding_siblings()

, but it doesn't seem to work. Any suggestions?

+2

ruby parsing nokogiri wiktionary hpricot

Dave 21 Sep '09 at 12:46

source to share

3 answers

Pesto · Answer 1 · 2009-09-22T14:53:23+0000

You can use the visitor template with Nokogiri. This code will remove everything from the definition of another h2 language:

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node

andre-r · Answer 2 · 2009-09-23T01:56:42+0000

The following code uses Hpricot .
It gets the text from the heading for English (h2) up to the next heading (h2) or up to the footer if there are no additional languages:

require 'hpricot'
require 'open-uri'

def get_english_definition(url)
  doc = Hpricot(open(url))

  span = doc.at('h2/span[@class="mw-headline"][text()=English]')
  english_header = span && span.parent
  return nil unless english_header

  next_header_or_footer =
    Hpricot::Elements[*english_header.following_siblings].at('h2') ||
    doc.at('[@class="printfooter"]')

  Hpricot::Elements.expand(english_header.next_node,
                           next_header_or_footer.previous_node).to_s
end

Example:

get_english_definition "http://en.wiktionary.org/wiki/gift"

JellicleCat · Answer 3 · 2011-08-18T18:25:42+0000

For Nokogiri:

doc = Nokogiri::HTML(code)
stop_node = doc.css('h2')[3]
doc.traverse do |node|
  break if node == stop_node
  # else, do whatever, e.g. `puts node.name`
end

This will traverse all nodes preceding any node you designate as stop_node

on line 2.

Finding all elements before h2 element in hpricot / nokogiri

More articles: