Ruby Nokogiri text search doesn't work with Br tags and others

I am using Nokogiri stone in Ruby and am facing some problems.

I want to clear URLs from web pages and no format will be set for displaying URLs.

I have a list of postal codes and I want my Ruby script to return a node including the postal code so that I can find the rest of the address.

This is what I have in Ruby, with some example HTML content:

require 'nokogiri'
require 'open-uri'

content1 = '
<div>
    <div>
        <div>Our Address:</div>
        1 North Street
        North Town
        North County
        N21 4DD
    </div>
</div>'

doc = Nokogiri::HTML(content1)
result = doc.search "[text()*='N21 4DD']"
puts result.inspect

      

This returns []

I understand that the above example is a weird way of displaying an address in HTML, but it's the simplest way to show the problems I had. Here's another variable content

that doesn't return anything:

content1 = '
<div>
    <div>Our Address:</div>
    <div>
        1 North Street<br>
        North Town<br>
        North County<br>
        N21 4DD
    </div>
</div>'

      

I know Nokogiri might run into the problem above because the tags are <br>

supposed to be </br>

, but this is quite common on websites.

THESE ARE EXAMPLES OF WORK:

content1 = '
<div>
    <div>Our Address:</div>
    <div>
        1 North Street
        North Town
        North County
        N21 4DD
    </div>
</div>'

      

Can someone explain why node is not found from the first two content

examples above and how can I fix it?

I am not looking for a specialized solution that will find the zip code in the examples content

above - this is for demonstration purposes only. Postal code (address) can be anywhere in the html - body

, p

, div

, td

, span

, li

, etc.

Thank.

+3


source to share


2 answers


Using Xpath:

doc.xpath('.//div[contains(.,"N21 4DD")]')



This still returns two nodes because there is a nested div. I'm not sure if there is a way to get the middle div without the "Our Address" div because it is in the same node.

0


source


Let's take a look at the first and how Nokogiri translates your "css" (which is not valid css btw):

Nokogiri::CSS.xpath_for "[text()*='N21 4DD']"
#=> ["//*[contains(child::text(), 'N21 4DD')]"]

      

Ok, so the problem is that child :: text () will actually only match the first text node, which is empty text before the "Our Address" section.

doc.search("//*[contains(child::text(), 'N21 4DD')]").length
#=> 0

      

No match = not good.

Now try using jQuery styling using an alias :contains

:



Nokogiri::CSS.xpath_for ":contains('N21 4DD')"
#=> ["//*[contains(., 'N21 4DD')]"]
doc.search("//*[contains(., 'N21 4DD')]").length
#=> 4

      

This is indeed correct, but perhaps not what you expected.

Try another way:

doc.search("//*[text()[contains(., 'N21 4DD')]]").length
#=> 1

      

It looks like this is what you are looking for. Just a div that has a string in the child text node.

0


source







All Articles