Get Specific Tables with Html Flexibility Package

I am having trouble getting a specific table using the Agility Pack. I can't change the actual HTML either, so I can't use other IDs or classes or whatever.

Can someone show me how I will access each individual table from the following?

<table class="newTable">
      //table 1 contents
    <table border="0" cellpadding="3" cellspacing="2" width="100%">
         //table 1 - A contents
    </table>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="newTable">
     //table 2 contents
    <table width="100%" border="0" cellspacing="2" cellpadding="0">
        //table 2 - A contents
    </table>
    <table width="100%" border="0" cellspacing="2" cellpadding="0">
       //table 2 - B contents
    </table>
    <table width="100%" cellspacing="2" cellpadding="0">
       //table 2 - C contents
    </table>
</table>
<table>
     //table 3 contents
</table>

      

Right now if I had to call the next one

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table");
foreach (var cell in table.SelectNodes("//tr/td"))
{
     string someVariable = cell.InnerText
}

      

I would go through everything. I want to be able to access tables in different ways to map to where I store data.

I tried looking at something like

doc.DocumentNode.SelectNodes("//table[1]");

but using the index doesn't seem to work, when I try to point a table with it, it is still readable in all tables or missing.

The same goes for this, it either doesn't work at all or it gets everything.

foreach (var cell in table.SelectNodes("//table").Skip(some_number))
{
     string someVariable = cell.InnerText
}

      

I am using Agility Pack 1.4.9 NuGet package

EDIT:

My attempt to get ONLY Table 1 - Contents. They both give null or endcodingfound exceptions.

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table/tr/td/table[1]");

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]/tr/td/table[1]");

+3


source to share


1 answer


Error with the second call, "// tr / td" will fall back to the root element. Your indexer is the correct solution for the first part of your problem, the second can be fixed by specifying that you want to move from where you are:

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var cell in table.SelectNodes(".//tr/td")) // **notice the .**
{
     string someVariable = cell.InnerText
}

      

Not sure what else is going on, but expanding the test table to this code , the following only works on my test. This may mean that you need to share a little more context.

This is the document I used for the tests:

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title></title>
</head>
<body>
    <table class="newTable">
        <tr>
            <td>
                <table border="0" cellpadding="3" cellspacing="2" width="100%">
                    <tr><td>
                        //table 1 - A contents
                    </td></tr>
                </table>
            </td>
        </tr>

    </table>
    <table border="0" cellpadding="0" cellspacing="0" class="newTable">
        <tr>
            <td>
                //table 2 contents
                <table width="100%" border="0" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - A contents
                        </td>
                    </tr>
                </table>
                <table width="100%" border="0" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - B contents
                        </td>
                    </tr>
                </table>
                <table width="100%" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - C contents
                        </td>
                    </tr>
                </table>
            </td>
        </tr>
    </table>
    <table>
        <tr>
            <td>
                //table 3 contents
            </td>
        </tr>
    </table>
</body>
</html>

      

And this code to extract the values ​​you are using:



HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);

var node1A = doc.DocumentNode.SelectSingleNode("//table[1]//table[1]");
string content1A = node1A.InnerText;
Console.WriteLine(content1A);

var node2C = doc.DocumentNode.SelectSingleNode("//table[2]//table[3]");
string content2C = node2C.InnerText;
Console.WriteLine(content2C);

      

Shows:

enter image description here

Update

Ok I took your actual HTML and I also got NullReference. There must be something very confusing about the Agility Pack, not sure why. Some experimentation with the Linq API seems to work, but I hope this might be an alternative for you:

var table = doc.DocumentNode.DescendantsAndSelf("table").Skip(1).First().Descendants("table").First();
var tds   = table.Descendants("td");

      

+5


source







All Articles