Parsing url images nokogiri

Question

Parsing url images nokogiri

I need to parse an image url from HTML like this:

<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>

So far I have been using Nokogiri to parse tags <h2>

with:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')

puts headers.text

I have two questions:

How can I parse the image url?
Ideally I will be printing to the console in this format:

 1. 
Header 1
image_url 1
image_url 2 (if any)
 2. 
Header 2
2image_url 1
2image_url 2 (if any)

And so far I have not been able to print my titles in this good format. How can i do this?

<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
          <p class="post_author"><em>by</em> author</p>
          <div class="format_text">
    <p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3"  target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2"  target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3"  target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8"  target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
        </div>  
          <p class="to_comments"><span class="date">February 15, 2013</span> &nbsp; <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>

+3

ruby parsing nokogiri

Steven harlow Feb 20 13 at 0:09

source to share

4 answers

To get images just look for tags img

with the attribute src

.

If you want to h2

be associated with each image you can do this:

doc.xpath('//img').each do |img|
  puts "Header: #{img.xpath('preceding::h2[1]').text}"
  puts "  Image: #{img['src']}"
end

Note that the switch in XPath was for axis preceding::

.

EDIT

To group by title, you can put them in a hash:

headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
  header = img.xpath('preceding::h2[1]').text
  image = img['src']
  headers[header] << image
end

To get your output:

headers.each do |h,urls|
  puts "#{h} #{urls.join(' ')}"
end

+5

Mark thomas Feb 20 At 1:10

source to share

The code I used. Feel free to criticize (I'll probably learn from him):

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://blog.website.com/"))

doc.xpath('//h2/a[@rel = "bookmark"]').each_with_index do |header, i|
  puts i+1
  puts " Title: #{header.text}"
  puts "  Image 1: #{header.xpath('following::img[1]')[0]["src"]}"
  puts "  Image 2: #{header.xpath('following::img[2]')[0]["src"]}"
end

0

Steven harlow 22 Feb At 1:26

source to share

I did something similar once (I wanted to get the same result in reality). This solution is pretty easy to follow:

Depending on your DOM structure, you can do something like:

body = page.css('div.format_text')
headers = page.css('div#content_inner h2 a')
post_counter = 1

body.each_with_index do |body,index| 
   header = headers[index]
   puts "#{post_counter}. " + header
   body.css('p a img, div > img').each{|img| puts img['src'] if img['src'].match(/\Ahttp/) }
   post_counter += 1
end

So basically, you check every header with 1 or more images. The page I was processing had headers outside of the div div, so I used two different variables to find them (body / headers). Also, I targeted two classes when searching for images, since this way was structured.

This should give you the good clean result you wanted.

Hope this helps!

0

Laura M 23 oct. 13 at 0:02

source to share

pguardiario · Accepted Answer · 2013-02-20T07:41:14+0000

I think it makes sense to group the h2 first:

doc.search('h2').each_with_index do |h2, i|
  puts "#{i+1}."
  puts h2.text
  h2.search('+ p + div > p[3] img').each do |img|
    puts img['src']
  end
end

Parsing url images nokogiri

More articles: