Node inner text ignoring children inner text

Excuse me if this sounds too simple to be asked, but since this is my first day with html-agility-pack, I cannot choose a way to select the inner text of a node that is a direct child of a node and ignore the inner text of child nodes.

for example

<div id="div1">
   <div class="h1"> this needs to be selected
   <small> and not this</small>
   </div>
</div>

      

I am currently trying this

HtmlDocument page = new HtmlWeb().Load(url);
var s = page.DocumentNode.SelectSingleNode("//div[@id='div1']//div[@class='h1']");
string selText = s.innerText;

      

which returns all of the text (for example, this should be selected, not this). Any suggestions?

+3


source to share


2 answers


You can use a parameter /text()

to get all text nodes immediately below a specific tag. If you only want the first one, add [1]

to it:



page.LoadHtml(text);
var s = page.DocumentNode.SelectSingleNode("//div[@id='div1']//div[@class='h1']/text()[1]");
string selText = s.InnerText; 

      

+2


source


div

can have multiple text nodes if there is text before and after its children. Since I also pointed out here , I think the best way to get all the text content of a node is to do something like:



HtmlDocument page = new HtmlWeb().Load(url);
var nodes = page.DocumentNode.SelectNodes("//div[@id='div1']//div[@class='h1']/text()");

StringBuilder sb  = new StringBuilder();
foreach(var node in nodes)
{
   sb.Append(node.InnerText);
}

string content = sb.ToString();

      

+3


source







All Articles