Nokigiri captures only visible inner_text

Is there a better way to extract visible text on a webpage using Nokogiri? I am currently using a method inner_text

, however this method treats a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.

For example, in IRB, if I do the following in Ruby 1.9.2-p290:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)

      

If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp? addFacet = REF_BRAND: BRAND_MET_RX The word "function" does not appear anywhere in the visible text.

Can JavaScript be ignored or is there a better way to do it?

+3


source to share


2 answers


You may try:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))

doc.traverse{ |x|
    if x.text? && x.text !~ /^\s*$/
        puts x.text
    end
}

      



I haven't done anything with Nokogiri, but I believe this should find / output all text nodes in the document that are not spaces. This at least seems to ignore javascript, and all the text I checked was visible on the page (although some are dropdown menus).

+3


source


You can ignore JavaScript and there is a better way. You ignore the power of Nokigiri. Not good.

Rather than providing you with a straightforward answer, learning to "fish" with Nokogiri will help you.

In a document like:

<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>

      

I recommend starting with CSS accessories because they are usually more familiar to people:



  • doc = Nokogiri::HTML(var_containing_html)

    will parse and return HTML DOM to doc

    .
  • doc.at('p')

    will return a Node that basically points to the first <p>

    node.
  • doc.search('p')

    returns a NodeSet of all matching nodes, which acts like an array, in this case all nodes <p>

    .
  • doc.at('p').text

    will return the text inside node.
  • doc.search('p').map{ |n| n.text }

    will return all the text in the nodes <p>

    as an array of text strings.

As your document gets more complex, you need to expand it. Sometimes you can do it with CSS accessories like 'body p'

or something similar, and sometimes you need to use XPath. I won't go into them, but there are great tutorials and links out there.

The Nokogiri tutorials are very good. Walk through them and they will reveal everything you need to know.

Additionally, there are many answers on Stack Overflow discussing this issue. Check out the "Related" links to the right of the page.

+1


source







All Articles