Html Agility Pack cannot find element using xpath, but it works fine with WebDriver

I've already seen these questions 1 and 2 but didn't work for me.

I am creating Xpath for objects that works fine with WebDriver, but when I try to select a node using HtmlAgilityPack it doesn't work in some cases.

I am using the latest HtmlAgilityPack 1.4.9

For example, Here is a page.

enter image description here

xpath for the object highlighted in red is

// section [@id = 'main-content'] / div 2 / div / div / div / div / div / p 1 / a

Similar to another object as shown in the picture

enter image description here

This is the xpath

// section [@id = 'main-content'] / div 2 / div / div / div / div / div / ul / li 2 / a

Both of these Xpaths work fine with WebDriver, but cannot find any object from the HtmlAgility package.

For the first one I tried

HtmlAgilityPack.HtmlNode.ElementsFlags.Remove ("p")

It started working, but why is it required? Also there is no luck for the second.

Is there a list of specific tags that need to be removed from ElementFlags? If there are any consequences, will it be his influence?

My requirement is to fetch objects using Xpath from HtmlAgility package, how WebDriver works.

Any help would be greatly appreciated.

EDIT 1:

The XPATH that we get from the HAP is also long like div / div / div / div / div / a Here's the VB.Net code for an example given by Sir Simon

Dim selectedNode As HtmlAgilityPack.HtmlNode = htmlAgilityDoc.DocumentNode.SelectSingleNode("//section[@id='main-content']//div[@class='pane-content']//a")

Dim xpathValue As String = selectedNode.XPath

      

Then the xpathValue we get from HAP is

/ html 1 / body 1 / section 1 / div 2 / div 1 / div 1 / div 1 / div 1 / div 1 / a 1

+3


source to share


1 answer


WebDriver will always rely on the target browser when running XPATH. Technically, this is just a fancy bridge to the browser (whether it's Firefox or Chrome browser - IE prior to 11 doesn't support XPATH)

Unfortunately, the DOM structure (elements and attributes) that are in browser memory is not the same as the DOM you probably provided to the Html Agility Pack. It could be the same if you loaded the HAP with the DOM content from browser memory (like the equivalent of document.OuterHtml). In general, this is not the case because developers use HAP to crop sites without a browser, so they pass it from a network stream (from an HTTP GET request) or a raw file.

This problem is easy to demonstrate. For example, if you create a file that contains only this:

<table><tr><td>hello world</td></tr></table>

      

(no html, tag tag, this is really invalid html file)

With HAP, you can load it like this:

HtmlDocument doc = new HtmlDocument();
doc.Load(myFile);

      

And the HAP structure will look like this:

+table
 +tr
  +td
   'hello world'

      

HAP is not a browser, it is a parser and it does not know the HTML specification, it just knows how to parse a bunch of tags and build a DOM with it. He does not know, for example, the document must start with HTML and contain a BODY, or that the TABLE element always has a TBODY child when rendered by the browser.

However, in Chrome browser you open this file, inspect it and ask XPATH for the TD element, it reports this:

/html/body/table/tbody/tr/td

      

Since Chrome just did it on its own ... As you can see, the two systems are not the same.

Note, if you have attributes id

available in the original HTML, the story is better, for example with the following HTML:

<table><tr><td id='hw'>hello world</td></tr></table>

      

Chrome will report the following XPATH (it will try to make the most of the attributes id

):



//*[@id="hw"]

      

which can also be used in HAP. But it doesn't work all the time. For example with the following HTML

<table id='hw'><tr><td>hello world</td></tr></table>

      

Chrome will now create this XPATH for TD:

//*[@id="mytable"]/tbody/tr/td

      

as you see this again cannot be used in HAP because of this supposed TBODY.

So in the end, you can't just blindly use the browser-generated XPATH in other contexts than those browsers. In other contexts, you will need to find other discriminators.

In fact, I personally think it's kind of a good thing because it will make your XPATH more resilient to change. But you'll have to think :-)

Now back to your case :)

The following C # console example should work fine:

  static void Main(string[] args)
  {
      var web = new HtmlWeb();
      var doc = web.Load("http://www2.epa.gov/languages/traditional-chinese");
      var node = doc.DocumentNode.SelectSingleNode("//section[@id='main-content']//div[@class='pane-content']//a");
      Console.WriteLine(node.OuterHtml); // displays <a href="http://www.oehha.ca.gov/fish/pdf/59329_CHINESE.pdf">...etc...</a>"
  }

      

If you look at the structure of a stream or file (or even what the browser displays, but be careful avoid TBODY ...) the easiest is

  • find id

    (like the browser) and / or
  • find unique child or child elements or attributes below this, recursively or not.
  • avoid too precise XPATH. Things like p/p/p/div/a/div/whatever

    bad

So here, after the attribute main-content

id

, we're just looking (recursively with //

) for a DIV that has a special class, and we're looking (recursively again) for the first child A

available.

This XPATH should work in webdriver and HAP.

Note that XPATH also works:, //div[@class='pane-content']//a

but that's a little vague to me. Adjusting the foot on attributes is id

often a good idea.

+2


source







All Articles