Ruby parse <a> reference to information from Nokogiri :: XML :: NodeSet

Question

Ruby parse <a> reference to information from Nokogiri :: XML :: NodeSet

I pulled Nokogiri::XML::NodeSet

from the page and here is the result:

<a href="http://www.goldsteinpatentlaw.com" target="_blank" title="Goldstein Patent Law ( U.S.A. )">
    <img src="http://www.asdf.com/LBM_Images/Offices//law-firm-goldstein-patent-law-photo-1258381.jpg" height="62" width="100" alt="Goldstein Patent Law (U.S.A.)">
</a>

I can't figure out how to turn the tag <a>

into a Mechanized / Nokogiri-parsed object so that I can easily extract bits of information from the link.

The Nokogiri / Mechanize docs are really confusing because I never really know what to look at. Not sure which is the first one that uses, etc. Seems very complicated for the simple scraping and analysis I am trying to do.

+3

ruby web-scraping mechanize

Kyle carlson 17 Sep 14 at 16:17

source to share

3 answers

A NodeSet is like an array. If you use puts () on a NodeSet, then, as with puts on an array, ruby will print the string representation of each element in the NodeSet on a separate line. NodeSets can contain various objects, but usually they will contain named objects <Nokogiri::XML::Element>

that represent tags in your html.

You can see from your output that your Nodeset only has one element and what you see is a string representation of that element. Here's an example:

require 'nokogiri'

str = "<div>hello</div><div>world</div>"
html_doc = Nokogiri::HTML(str)

divs = html_doc.xpath("//div")

divs.each do |div|
  p div
end

puts '*' * 10
puts divs


    --output:--
#<Nokogiri::XML::Element:0x80836ec4 name="div" children=[#<Nokogiri::XML::Text:0x80836a00 "hello">]>
#<Nokogiri::XML::Element:0x80836668 name="div" children=[#<Nokogiri::XML::Text:0x80836064 "world">]>
**********
<div>hello</div>
<div>world</div>

So, you just need to get the first element of your NodeSet, just like you would retrieve the first element in an array:

p divs[0]

Or, if you know that there will only be one element in your NodeSet, you can use:

div = html_doc.at_xpath("//div")

which, instead of returning a NodeSet, simply returns the first element matching the xpath.

When you really want to know what you have, you should use p

instead puts

.

+2

7stud 17 Sep 14 at 17:06

source to share

Maybe a little late here, but for more details on NodeSets, see here: http://www.rubydoc.info/gems/nokogiri/Nokogiri/XML/NodeSet#attr-instance_method

According to their docs, this is the code I used to do what you were trying to do and it works!

result.search("h2 > a").attr("href")

0

John H. Jan 16 17 at 18:11

source to share

engineersmnky · Accepted Answer · 2014-09-17T16:49:51+0000

Is this what you are looking for?

require 'nokogiri'
str = '<a href="http://www.goldsteinpatentlaw.com" target="_blank" title="Goldstein Patent Law ( U.S.A. )">
          <img src="http://www.asdf.com/LBM_Images/Offices//law-firm-goldstein-patent-law-photo-1258381.jpg" height="62" width="100" alt="Goldstein Patent Law (U.S.A.)">
       </a>'
doc = Nokogiri::HTML(str)
link = doc.at('a')
#=> #<Nokogiri::XML::Element:0x1744488 name="a" attributes=[
     #<Nokogiri::XML::Attr:0x174444c name="href" value="http://www.goldsteinpatentlaw.com">, 
     #<Nokogiri::XML::Attr:0x1744440 name="target" value="_blank">,
     #<Nokogiri::XML::Attr:0x1744434 name="title" value="Goldstein Patent Law ( U.S.A. )">] children=[#<Nokogiri::XML::Text:0x1743d20 "\n    ">, 
     #<Nokogiri::XML::Element:0x1743c9c name="img" attributes=[#<Nokogiri::XML::Attr:0x1743c60 name="src" value="http://www.asdf.com/LBM_Images/Offices//law-firm-goldstein-patent-law-photo-1258381.jpg">, 
     #<Nokogiri::XML::Attr:0x1743c54 name="height" value="62">, #<Nokogiri::XML::Attr:0x1743c48 name="width" value="100">, 
     #<Nokogiri::XML::Attr:0x1743c3c name="alt" value="Goldstein Patent Law (U.S.A.)">]>,
     #<Nokogiri::XML::Text:0x17433d8 "\n">]>

You can use selectors at

, at_css

or at_xpath

, to get what you want and then do something like

link.attributes["href"].value
#=> "http://www.goldsteinpatentlaw.com"
link.attributes["title"].value
#=> "Goldstein Patent Law ( U.S.A. )"

Ruby parse <a> reference to information from Nokogiri :: XML :: NodeSet

More articles: