Open-uri + hpricot & nokogiri do not parse html correctly

I am trying to parse a webpage using open-uri + hpricot, but there seems to be a problem in the parsing process as the gems are not getting me what I want.

Specifically, I want to get this div (whose id is ' pasajes ) in this url:

http://www.despegar.com.ar

I am writing this code:

require 'nokogiri'
require 'hpricot'
require 'open-uri'

document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI

pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")

      

But it brings ANYTHING! I've tried many things in both hpricot and nokogiri:

  • I am trying to give an absolute path to this div
  • I am trying the CSS way with selectors
  • I am trying to use the hpricot search shortcut (doc // "div # pasajes")
  • Almost every possible relative path to reach the 'pasajes' div

Finally, I found a terrible solution. I used the watir library and after opening the web browser, I passed the html to hpricot. So hpricot WELCOME "div pasajes". But I don't want to just open the web browser just for parsing purposes ...

What am I doing wrong? Is open-uri not performing well? Is hpricot?

+2


source to share


4 answers


There is no DIV with id pasajes in a static HTML page. If you are using * nix you can see that:

curl http://www.despegar.com.ar/ | grep pasajes

      



I am assuming it is generated by JavaScript.

If you are using MacRuby you can try Lyndon .

+4


source


There is no div with id 'pasajes' on this page. This is problem.



+3


source


This is more suited as an additional comment to Jonas' answer above than the answer itself ... But I'm new to SO and don't have any "comments" yet :)

You can use Selenium RC to download the complete HTML and then use nokogiri on the downloaded file. Please note that this will only work if the content is generated / modified by Javascript. If a web page relies on cookies to customize content, your options will be Selenium (in the browser) or watir as you noted.

I would love to hear a better solution to this question (want to parse a web page using nokogiri, but the page was modified by JS).

+1


source


I faced a similar problem with Nokogiri, but on OS X 10.5. However, I first tried open-uri to open the pages in question which have a lot of HTML divs, p. I found using:

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

      

I would see a lot of great HTML. I also found that by reading the "file" on line and passing this to Nokogiri, I could get this to work fine. I even had to change the demo they use in Rubyforge to teach you Nokogiri.

Using my own example, I get this:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=> 

      

YUCK!

If I tune in to read into the string url I get good things:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

      

Note I see this beautiful warning when I use irb to play:

HI. You are using libxml2 version 2.6.16 which is over 4 years old and there are many bugs. We suggest that for the ultimate HTML / XML parsing experience, you upgrade libxml2 and reinstall nokogiri. If you like using libxml2 version 2.6.16 but dislike this warning, define the constant I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before asking nokogiri.

But I am not in the mood to deal with horror and various experts, but is contrary to the advice to fix libxml in / usr / local blah blah. The link text post has a lot of explanation, but then another * nix master attacks the very concept with some audible warnings and issues. So I say no.

Why am I writing this? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. OS X 10.5 is on older stuff and they might have problems with that.

Question

Do other OS X 10.5 users have this problem with Nokogiri?

+1


source







All Articles