Ruby Nokogiri text search doesn't work with Br tags and others
I am using Nokogiri stone in Ruby and am facing some problems.
I want to clear URLs from web pages and no format will be set for displaying URLs.
I have a list of postal codes and I want my Ruby script to return a node including the postal code so that I can find the rest of the address.
This is what I have in Ruby, with some example HTML content:
require 'nokogiri'
require 'open-uri'
content1 = '
<div>
<div>
<div>Our Address:</div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
doc = Nokogiri::HTML(content1)
result = doc.search "[text()*='N21 4DD']"
puts result.inspect
This returns []
I understand that the above example is a weird way of displaying an address in HTML, but it's the simplest way to show the problems I had. Here's another variable content
that doesn't return anything:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street<br>
North Town<br>
North County<br>
N21 4DD
</div>
</div>'
I know Nokogiri might run into the problem above because the tags are <br>
supposed to be </br>
, but this is quite common on websites.
THESE ARE EXAMPLES OF WORK:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
Can someone explain why node is not found from the first two content
examples above and how can I fix it?
I am not looking for a specialized solution that will find the zip code in the examples content
above - this is for demonstration purposes only. Postal code (address) can be anywhere in the html - body
, p
, div
, td
, span
, li
, etc.
Thank.
source to share
Let's take a look at the first and how Nokogiri translates your "css" (which is not valid css btw):
Nokogiri::CSS.xpath_for "[text()*='N21 4DD']"
#=> ["//*[contains(child::text(), 'N21 4DD')]"]
Ok, so the problem is that child :: text () will actually only match the first text node, which is empty text before the "Our Address" section.
doc.search("//*[contains(child::text(), 'N21 4DD')]").length
#=> 0
No match = not good.
Now try using jQuery styling using an alias :contains
:
Nokogiri::CSS.xpath_for ":contains('N21 4DD')"
#=> ["//*[contains(., 'N21 4DD')]"]
doc.search("//*[contains(., 'N21 4DD')]").length
#=> 4
This is indeed correct, but perhaps not what you expected.
Try another way:
doc.search("//*[text()[contains(., 'N21 4DD')]]").length
#=> 1
It looks like this is what you are looking for. Just a div that has a string in the child text node.
source to share