Parsing terrible HTML: How to recognize borders using xpath?

This will sound like a joke, but I promise this is real life. There is a site on the internet that you all have used that doesn't believe in css classes. Everything is defined directly in the style tag for the element. It's horrible.

My problem is that it also makes it difficult to parse the html. The structure I must continue with looks something like this:

<td>
    <a name="<random_string>"></a>
    <div style="generic-style, used by other elements">
        <div style="similarly generic style">{some_stuff}</div>
    </div>
    <a name="<random_string>"></a>
    ...
</td>

      

Basically, I have tags a

that form the boundaries of the views, and only the definition of the information is a random string that is their name. I don't really care about anchor tags, but I would like to get feedback in between using xpath.

I've looked at mating queries , but they don't seem to work for interleaved boundaries. I also looked at the Kayessian

xpath queries method , which (besides having an awesome name) seems to be perfectly fine for grabbing a specific div, not all divs between anchor tags.

Any thoughts on how I can grab the divs here?

+3


source to share


2 answers


If //td/div[../a[@name]]

works for you, then the following should also work:

//td[a/@name]/div

      

That way, you don't have to go back and forth, or rather down and up. For a more specific selector, you can try this:



//td/div[preceding-sibling::*[1][self::a/@name]][following-sibling::*[1][self::a/@name]]

      

XPath selects an element div

with all of the following properties:

  • td/div

    : is a child of <td>

    element

  • [preceding-sibling::*[1][self::a/@name]]

    : preceded by an element <a>

    that has an attributename

  • [following-sibling::*[1][self::a/@name]]

    : a straight element <a>

    having an attributename

+1


source


I understood that! It turns out that xpath will allow relative attributes to be asserted. I'm not sure if this behavior is desirable, but it works in this case! Here xpath:

//td/div[../a[@name]]

      



Nice and clean, ../a[@name]

basically just says:

Go up the level and make sure there is an element with a name attribute at this level of the hierarchy

+1


source







All Articles