Parsing url images nokogiri
I need to parse an image url from HTML like this:
<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>
So far I have been using Nokogiri to parse tags <h2>
with:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')
puts headers.text
I have two questions:
- How can I parse the image url?
- Ideally I will be printing to the console in this format:
1. Header 1 image_url 1 image_url 2 (if any) 2. Header 2 2image_url 1 2image_url 2 (if any)
And so far I have not been able to print my titles in this good format. How can i do this?
<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
<p class="post_author"><em>by</em> author</p>
<div class="format_text">
<p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3" target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2" target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3" target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8" target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
</div>
<p class="to_comments"><span class="date">February 15, 2013</span> <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>
source to share
To get images just look for tags img
with the attribute src
.
If you want to h2
be associated with each image you can do this:
doc.xpath('//img').each do |img|
puts "Header: #{img.xpath('preceding::h2[1]').text}"
puts " Image: #{img['src']}"
end
Note that the switch in XPath was for axis preceding::
.
EDIT
To group by title, you can put them in a hash:
headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
header = img.xpath('preceding::h2[1]').text
image = img['src']
headers[header] << image
end
To get your output:
headers.each do |h,urls|
puts "#{h} #{urls.join(' ')}"
end
source to share
The code I used. Feel free to criticize (I'll probably learn from him):
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://blog.website.com/"))
doc.xpath('//h2/a[@rel = "bookmark"]').each_with_index do |header, i|
puts i+1
puts " Title: #{header.text}"
puts " Image 1: #{header.xpath('following::img[1]')[0]["src"]}"
puts " Image 2: #{header.xpath('following::img[2]')[0]["src"]}"
end
source to share
I did something similar once (I wanted to get the same result in reality). This solution is pretty easy to follow:
Depending on your DOM structure, you can do something like:
body = page.css('div.format_text')
headers = page.css('div#content_inner h2 a')
post_counter = 1
body.each_with_index do |body,index|
header = headers[index]
puts "#{post_counter}. " + header
body.css('p a img, div > img').each{|img| puts img['src'] if img['src'].match(/\Ahttp/) }
post_counter += 1
end
So basically, you check every header with 1 or more images. The page I was processing had headers outside of the div div, so I used two different variables to find them (body / headers). Also, I targeted two classes when searching for images, since this way was structured.
This should give you the good clean result you wanted.
Hope this helps!
source to share