Easiest way to remove invalid characters from xml file?

I have an xml file with invalid characters. I searched the internet and found no other way than to read the file as a text file and replace the invalid characters one by one.

Can someone please tell me the easiest way to remove invalid characters from XML file.

ex xml stream:

<Year>where 12 > 13 occures </Year>

      

+1


source to share


2 answers


I would try HtmlAgilityPack

. At least better than trying to disassemble by hand.

HtmlAgilityPack.HtmlDocument hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml("<Year>where 12 > 13 occures </Year>");

using(StringWriter wr = new StringWriter())
{
   using (XmlWriter xmlWriter = XmlWriter.Create(wr,
           new XmlWriterSettings() { OmitXmlDeclaration = true }))
   {
       hdoc.Save(xmlWriter);
       Console.WriteLine(wr.ToString());
   }
}

      



these outputs:

<year>where 12 &gt; 13 occures </year>

      

+3


source


Start by thinking about the question in different ways. Your problem is that the input is not valid XML. So you really want to remove the invalid characters from the non-XML file. This may sound pedantic, but it immediately indicates that XML processing tools won't help you, because your input is not XML.

Fixing the problem at the source is always better than trying to repair the damage later. But you are going to start a repair strategy first of all to determine exactly what data errors you want to recover and how you are going to repair them. It's also a good idea to be clear about what constraints you apply to the solution: for example, does it matter if your repair accidentally changes the content of any comments or CDATA sections?



Once you have defined your recovery strategy: for example, "replace any and with &amp;

, unless it is immediately followed by either #nn or #xnn or a name followed by a ';', coding it becomes quite simple.

0


source







All Articles