Retrieving all nodes from HTML document to Ruby using Nokogiri
I am trying to get all nodes from an HTML document using Nokogiri. I can use something else if you guys think it's easier.
I have this HTML:
<html>
<body>
<h1>Header1</h1>
<h2>Header22</h2>
<ul>
<li>Li1</li>
<ul>
<li>Li1</li>
<li>Li2</li>
</ul>
</ul>
</body>
</html>
String version:
string_page = "<html><body><h1>Header1</h1><h2>Header22</h2><ul><li>Li1</li><ul><li>Li1</li><li>Li2</li></ul></ul></body></html>"
I created an object:
page = Nokogiri.HTML(string_page)
And I tried to cross it:
result = []
page.traverse { |node| result << node.name unless node.name == "text" }
=> ["html", "h1", "h2", "li", "li", "li", "ul", "ul", "body", "html", "document"]
But I don't like the order of the items. I need to have an array with the same order as they appear:
["html", "body", "h1", "h2", "ul", "li", "ul", "li", "li" ]
I don't need any closing tags.
Does anyone have a better solution for this?
source to share
If you want to see the nodes in order, use an XPath selector, for example '*'
, which means "everything" starting at the root node:
require 'nokogiri'
string_page = "<html><body><h1>Header1</h1></body></html>"
doc = Nokogiri::HTML(string_page)
doc.search('*').map(&:name)
# => ["html", "body", "h1"]
But we usually don't recommend iterating over all nodes, and we usually don't want to. We want to find all nodes of a specific type or individual nodes, so we look for landmarks in the markup and go from there:
doc.at('h1').text # => "Header1"
or
html = "<html><body><table><tr><td>cell1</td></tr><tr><td>cell2</td></tr></h1></body></html>"
doc = Nokogiri::HTML(html)
doc.search('table tr td').map(&:text) # => ["cell1", "cell2"]
or
doc.search('tr td').map(&:text) # => ["cell1", "cell2"]
or
doc.search('td').map(&:text) # => ["cell1", "cell2"]
Note: there is no reason to use a longer example HTML string; This just clutters up the question, so use a minimal example.
See " How to avoid joining all text from Nodes when clearing ."
source to share