How to remove href tag from CDATA
I have the following CDATA inside an xml document:
<![CDATA[ <p xmlns="">Refer to the below: <br/>
</p>
<table xmlns:abc="http://google.com pic.xsd" cellspacing="1" class="c" type="custom" width="100%">
<tbody>
<tr xmlns="">
<th style="text-align: left">Basic offers...</th>
</tr>
<tr xmlns="">
<td style="text-align: left">Faster network</td>
<td style="text-align: left">
<ul>
<li>Session</li>
</ul>
</td>
</tr>
<tr xmlns="">
<td style="text-align: left">capabilities</td>
<td style="text-align: left">
<ul>
<li>Navigation,</li>
<li>message, and</li>
<li>contacts</li>
</ul>
</td>
</tr>
<tr xmlns="">
<td style="text-align: left">Data</td>
<td style="text-align: left">
<p>Here visit google for more info <a href="http://www.google.com" target="_blank"><font color="#0033cc">www.google.com</font></a>.</p>
<p>Remove this href tag <a href="/abc/def/{T}/t/1" target="_blank">Information</a> remove the tag.</p>
</td>
</tr>
</tbody>
</table>
<p xmlns=""><br/>
</p>
]]>
I want to somehow scan href = "/ abc / def and remove the href tag that starts with abc / def. In the above example, remove the href tag and just leave the text" Information "inside the tag. More than one href tag with "abc / def ... in it. I am using C # for this application. Can someone please help me and tell me how this can be done? Should I use a regex or is there a way to do this using the xml itself?
This is the regex I'm trying:
"<a href=\"/abc/def/.*></a>"
I want the inner text of the href tag to just remove the tags. But over regex doesn't work.
source to share
Using HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.Descendants("a")
.Where(n => n.Attributes.Any(a => a.Name == "href" && a.Value.StartsWith("/abc/def")))
.ToArray();
foreach(var node in nodes)
{
node.ParentNode.RemoveChild(node,true);
}
var newHtml = doc.DocumentNode.InnerHtml;
source to share
I used HtmlAgilityPack for this task. The task itself is quite simple: select the nodes using xpath and then delete them. All that remains is to get the HTML result:
It is a .NET code library that allows you to parse "offline" HTML files. The parser is very tolerant of garbled "real world" HTML. The object model is very similar to what System.Xml offers, but for HTML documents (or streams).
var doc = new HtmlDocument();
doc.LoadHtml(xml);
var anchors = doc.DocumentNode.SelectNodes("//a[starts-with(@href, '/abc/def')]");
foreach (var anchor in anchors.ToList())
anchor.Remove();
var result= doc.DocumentNode.OuterHtml;
This will give you exactly what you want.
EDIT:
If you only want to remove the attribute href
, change this line anchor.Remove()
to thisanchor.Attributes["href"].Remove();
source to share
If the HTML is well-formed XML (which looks like this at first glance), you can load the cdata node text into a new XML document, modify the XML accordingly, and then replace the original cdata node text with the XML text of your modified document.
Since cdata is by definition not parsed in the original XML document, so you'll need a secondary one.
source to share
Note. I do not recommend running this Regex on an entire XML string, as most agree that it is bad. The following regex can and should be executed on specific nodes in the document if traversed correctly. The solution was posted as one regex replacement across the whole xmlString as that was what the user requested and they were having trouble adapting the regex to their particular situation - I wrote character by character to match how they intended use it as closely as possible.
To remove all tags href
where the URL starts with /abc/def/
, you'd be better off using a regex:
result = Regex.Replace(xmlString, @"<a href=\"/abc/def/.*>(.*)</a>", "$1");
Follow the comments below
According to MSDN :
Within the specified input string, replaces all strings that match the specified regular expression with the specified replacement.
This replacement will be done in all cases, not just the first. If the others don't work, it's because there is something different in them that doesn't match the regex.
For example, if in some cases there are extra spaces between a and href, or the target field is specified before the href field, you will need to use someone less specific replacement:
result = Regex.Replace(str, @"<a.*href=\"/OST/OSTdisplay/.*>(.*)</a>", "$1");
source to share