Target text without tags with Nokogiri
I have a very simple HTML that I'm trying to parse with Nokogiri (in Ruby):
<span>Address</span><br />
123 Main Street<br />
Sometown<br />
<span>Telephone</span><br />
<a href="tel:212-555-555">212-555-555</a><br />
<span>Hours</span><br />
M-F: 8:00-21:00<br />
Sat-Sun: 8:00-21:00<br />
<hr />
The only tag I have is the surrounding one <div>
for the page content. Each of the things I want is preceded by a tag <span>Address</span>
. It can be followed by another one span
or hr
at the end.
I would like to get the address ("123 Main Street \ nSometown"), phone number ("212-555-555") and opening hours as separate fields.
Is there a way to get the information using Nokogiri, or would it be easier to do this with regular expressions?
source to share
Using Nokogiri and XPath , you can do something like this:
def extract_span_data(html)
doc = Nokogiri::HTML(html)
doc.xpath("//span").reduce({}) do |memo, span|
text = ''
node = span.next_sibling
while node && (node.name != 'span')
text += node.text
node = node.next_sibling
end
memo[span.text] = text.strip
memo
end
end
extract_span_data(html_string)
# {
# "Address" => "123 Main Street\nSometown",
# "Telephone" => "212-555-555",
# "Hours" => "M-F: 8:00-21:00\n Sat-Sun: 8:00-21:00"
# }
Using the correct parser is easier and more reliable than using regular expressions (a well-documented TM bad idea .)
source to share
I was thinking (rather learning) about xpath:
d.xpath("span[2]/preceding-sibling::text()").each {|i| puts i}
# 123 Main Street
# Sometown
d.xpath("a/text()").text
# "212-555-555"
d.xpath("span[3]/following::text()").text.strip
# "M-F: 8:00-21:00 Sat-Sun: 8:00-21:00"
The first one starts at the second interval and selects the text () that comes before.
You can try a different approach here - start at the first range, select text () and end up using a predicate that checks the next range.
d.xpath("span[1]/following::text()[following-sibling::span]").each {|i| puts i}
# 123 Main Street
# Sometown
If the document has more spacing, you can start with the correct ones: span[x]
you can replace with span[contains(.,'text-in-span')]
span[3]
==span[contains(.,'Hours')]
Correct me if something is really wrong.
source to share