Open-uri + hpricot & nokogiri do not parse html correctly

Question

Open-uri + hpricot & nokogiri do not parse html correctly

I am trying to parse a webpage using open-uri + hpricot, but there seems to be a problem in the parsing process as the gems are not getting me what I want.

Specifically, I want to get this div (whose id is ' pasajes ) in this url:

http://www.despegar.com.ar

I am writing this code:

require 'nokogiri'
require 'hpricot'
require 'open-uri'

document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI

pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")

But it brings ANYTHING! I've tried many things in both hpricot and nokogiri:

I am trying to give an absolute path to this div
I am trying the CSS way with selectors
I am trying to use the hpricot search shortcut (doc // "div # pasajes")
Almost every possible relative path to reach the 'pasajes' div

Finally, I found a terrible solution. I used the watir library and after opening the web browser, I passed the html to hpricot. So hpricot WELCOME "div pasajes". But I don't want to just open the web browser just for parsing purposes ...

What am I doing wrong? Is open-uri not performing well? Is hpricot?

+2

ruby parsing nokogiri open-uri watir

flyer88 31 Aug '09 at 14:30

source to share

4 answers

There is no div with id 'pasajes' on this page. This is problem.

+3

JtR 31 Aug 09 at 14:38

source to share

This is more suited as an additional comment to Jonas' answer above than the answer itself ... But I'm new to SO and don't have any "comments" yet :)

You can use Selenium RC to download the complete HTML and then use nokogiri on the downloaded file. Please note that this will only work if the content is generated / modified by Javascript. If a web page relies on cookies to customize content, your options will be Selenium (in the browser) or watir as you noted.

I would love to hear a better solution to this question (want to parse a web page using nokogiri, but the page was modified by JS).

+1

arnab 03 Sep At 7:04 am

source to share

I faced a similar problem with Nokogiri, but on OS X 10.5. However, I first tried open-uri to open the pages in question which have a lot of HTML divs, p. I found using:

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

I would see a lot of great HTML. I also found that by reading the "file" on line and passing this to Nokogiri, I could get this to work fine. I even had to change the demo they use in Rubyforge to teach you Nokogiri.

Using my own example, I get this:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=>

YUCK!

If I tune in to read into the string url I get good things:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

Note I see this beautiful warning when I use irb to play:

HI. You are using libxml2 version 2.6.16 which is over 4 years old and there are many bugs. We suggest that for the ultimate HTML / XML parsing experience, you upgrade libxml2 and reinstall nokogiri. If you like using libxml2 version 2.6.16 but dislike this warning, define the constant I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before asking nokogiri.

But I am not in the mood to deal with horror and various experts, but is contrary to the advice to fix libxml in / usr / local blah blah. The link text post has a lot of explanation, but then another * nix master attacks the very concept with some audible warnings and issues. So I say no.

Why am I writing this? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. OS X 10.5 is on older stuff and they might have problems with that.

Question

Do other OS X 10.5 users have this problem with Nokogiri?

+1

user187751 10 oct. 09 at 18:41

source to share

Jonas Elfström · Accepted Answer · 2009-08-31T14:40:29+0000

There is no DIV with id pasajes in a static HTML page. If you are using * nix you can see that:

curl http://www.despegar.com.ar/ | grep pasajes

I am assuming it is generated by JavaScript.

If you are using MacRuby you can try Lyndon .

Open-uri + hpricot & nokogiri do not parse html correctly

More articles: