Ruby Nokogiri text search doesn't work with Br tags and others

Question

Ruby Nokogiri text search doesn't work with Br tags and others

I am using Nokogiri stone in Ruby and am facing some problems.

I want to clear URLs from web pages and no format will be set for displaying URLs.

I have a list of postal codes and I want my Ruby script to return a node including the postal code so that I can find the rest of the address.

This is what I have in Ruby, with some example HTML content:

require 'nokogiri'
require 'open-uri'

content1 = '
<div>
    <div>
        <div>Our Address:</div>
        1 North Street
        North Town
        North County
        N21 4DD
    </div>
</div>'

doc = Nokogiri::HTML(content1)
result = doc.search "[text()*='N21 4DD']"
puts result.inspect

This returns []

I understand that the above example is a weird way of displaying an address in HTML, but it's the simplest way to show the problems I had. Here's another variable content

that doesn't return anything:

content1 = '
<div>
    <div>Our Address:</div>
    <div>
        1 North Street<br>
        North Town<br>
        North County<br>
        N21 4DD
    </div>
</div>'

I know Nokogiri might run into the problem above because the tags are <br>

supposed to be </br>

, but this is quite common on websites.

THESE ARE EXAMPLES OF WORK:

content1 = '
<div>
    <div>Our Address:</div>
    <div>
        1 North Street
        North Town
        North County
        N21 4DD
    </div>
</div>'

Can someone explain why node is not found from the first two content

examples above and how can I fix it?

I am not looking for a specialized solution that will find the zip code in the examples content

above - this is for demonstration purposes only. Postal code (address) can be anywhere in the html - body

, p

, div

, td

, span

, li

, etc.

Thank.

+3

ruby nokogiri

Emb 10 jul. 17 at 18:19

source to share

2 answers

whodini9 · Answer 1 · 2017-07-10T18:48:00+0000

Using Xpath:

doc.xpath('.//div[contains(.,"N21 4DD")]')

This still returns two nodes because there is a nested div. I'm not sure if there is a way to get the middle div without the "Our Address" div because it is in the same node.

pguardiario · Answer 2 · 2017-08-06T22:35:12+0000

Let's take a look at the first and how Nokogiri translates your "css" (which is not valid css btw):

Nokogiri::CSS.xpath_for "[text()*='N21 4DD']"
#=> ["//*[contains(child::text(), 'N21 4DD')]"]

Ok, so the problem is that child :: text () will actually only match the first text node, which is empty text before the "Our Address" section.

doc.search("//*[contains(child::text(), 'N21 4DD')]").length
#=> 0

No match = not good.

Now try using jQuery styling using an alias :contains

:

Nokogiri::CSS.xpath_for ":contains('N21 4DD')"
#=> ["//*[contains(., 'N21 4DD')]"]
doc.search("//*[contains(., 'N21 4DD')]").length
#=> 4

This is indeed correct, but perhaps not what you expected.

Try another way:

doc.search("//*[text()[contains(., 'N21 4DD')]]").length
#=> 1

It looks like this is what you are looking for. Just a div that has a string in the child text node.

Ruby Nokogiri text search doesn't work with Br tags and others

More articles: