Target text without tags with Nokogiri

Question

Target text without tags with Nokogiri

I have a very simple HTML that I'm trying to parse with Nokogiri (in Ruby):

<span>Address</span><br />
123 Main Street<br />
Sometown<br />
<span>Telephone</span><br />
<a href="tel:212-555-555">212-555-555</a><br />

    <span>Hours</span><br />
    M-F: 8:00-21:00<br />
       Sat-Sun: 8:00-21:00<br />
<hr />

The only tag I have is the surrounding one <div>

for the page content. Each of the things I want is preceded by a tag <span>Address</span>

. It can be followed by another one span

or hr

at the end.

I would like to get the address ("123 Main Street \ nSometown"), phone number ("212-555-555") and opening hours as separate fields.

Is there a way to get the information using Nokogiri, or would it be easier to do this with regular expressions?

+3

ruby regex text-parsing nokogiri

nevan king 13 Feb At 16:34

source to share

2 answers

I was thinking (rather learning) about xpath:

d.xpath("span[2]/preceding-sibling::text()").each {|i| puts i}
# 123 Main Street
# Sometown

d.xpath("a/text()").text
# "212-555-555"

d.xpath("span[3]/following::text()").text.strip
# "M-F: 8:00-21:00       Sat-Sun: 8:00-21:00"

The first one starts at the second interval and selects the text () that comes before.
You can try a different approach here - start at the first range, select text () and end up using a predicate that checks the next range.

d.xpath("span[1]/following::text()[following-sibling::span]").each {|i| puts i}
# 123 Main Street
# Sometown

If the document has more spacing, you can start with the correct ones:
span[x]

you can replace with span[contains(.,'text-in-span')]

span[3]

==span[contains(.,'Hours')]

Correct me if something is really wrong.

0

AD 13 Feb 13 at 22:55

source to share

maerics · Accepted Answer · 2013-02-13T20:09:42+0000

Using Nokogiri and XPath , you can do something like this:

def extract_span_data(html)
  doc = Nokogiri::HTML(html)
  doc.xpath("//span").reduce({}) do |memo, span|
    text = ''
    node = span.next_sibling
    while node && (node.name != 'span')
      text += node.text
      node = node.next_sibling
    end
    memo[span.text] = text.strip
    memo
  end
end

extract_span_data(html_string)
# {
#   "Address"   => "123 Main Street\nSometown",
#   "Telephone" => "212-555-555",
#   "Hours"     => "M-F: 8:00-21:00\n       Sat-Sun: 8:00-21:00"
# }

Using the correct parser is easier and more reliable than using regular expressions (a well-documented ^TM bad idea .)

Target text without tags with Nokogiri

More articles: