Using C #, how do I close the wrong XML tags?

Background

I inherited loading XML files that contain a tag with two holes in sequence, not open and close. I need to skip all these files and fix the incorrect XML.

Here's a simplified example of bad XML that is the same tag in every file:

<meals>
    <breakfast>
         Eggs and Toast
    </breakfast>
    <lunch>
         Salad and soup
    <lunch>
    <supper>
         Roast beef and potatoes
    </supper>
</meals>

      

Note that the tag <lunch>

has no closure. This is consistent across all files.

Question

Would it be better to use regex

C # to fix this, and if so how would I do it exactly?

I already know how to iterate over the filesystem and read documents to XML or string object, so you don't need to answer this part.

Thank!

+3


source to share


4 answers


I think regexes would be a little overkill if the situation was as simple as you describe it (i.e. it is always the same tag and there is always only one of them). If your XML files are relatively small (kilobytes, not megabytes), you can just load the whole thing into memory, use string operations to insert the missing forward slash, and call it day. This will be significantly more efficient (faster) than using regular expressions. If your files are very large, you can just change it to read in the file one at a time until it finds the first tag <lunch>

, then find the next one and change it accordingly. Here's the code to get you started:



var xml = File.ReadAllText( @"C:\Path\To\NaughtyXml.xml" );

var firstLunchIdx = xml.IndexOf( "<lunch>" );
var secondLunchIdx = xml.IndexOf( "<lunch>", firstLunchIdx+1 );

var correctedXml = xml.Substring( 0, secondLunchIdx + 1 ) + "/" +
xml.Substring( secondLunchIdx + 1 );

File.WriteAllText( @"C:\Path\To\CorrectedXml.xml", correctedXml );

      

+2


source


If your broken XML is relatively simple, as you showed in the question, then you can get away with some simplified logic and a basic regex.

    public static void Main(string[] args)
    {
        string broken = @"
<meals>
    <breakfast>
         Eggs and Toast
    </breakfast>
    <lunch>
         Salad and soup
    <lunch>
    <supper>
         Roast beef and potatoes
    </supper>
</meals>";

        var pattern1 = "(?<open><(?<tag>[a-z]+)>)([^<]+?)(\\k<open>)";
        var re1 = new Regex(pattern1, RegexOptions.Singleline);

        String work = broken;
        Match match = null;
        do
        {
            match = re1.Match(work);
            if (match.Success)
            {
                Console.WriteLine("Match at position {0}.", match.Index);
                var tag = match.Groups["tag"].ToString();

                Console.WriteLine("tag: {0}", tag.ToString());

                work = work.Substring(0, match.Index) +
                    match.Value.Substring(0, match.Value.Length - tag.Length -1) +
                    "/" +
                    work.Substring(match.Index + match.Value.Length - tag.Length -1);

                Console.WriteLine("fixed: {0}", work);
            }
        } while (match.Success);
    }

      

This regular expression uses a naming function named "named" for .NET regular expressions. ?<open>

indicates that the group captured by the closing parades will be available under the name "open". This grouping captures the opening tag including angle brackets. It is assumed that there is no xml attribute in the opening tag. Within this grouping, there is another named group - it uses the name "tag" and captures the tag name itself, without angle brackets.

The regex then lazily grabs a bunch of intermediate text ( (.+?)

) and then another "open" tag that is referenced with a backlink. The tape grab is there, so it doesn't break any possible intermediate open tag in the text.

Since XML can span multiple newlines, you will need RegexOptions.Singleline

.



The logic then applies this regex in a loop, replacing any matched text with a fixed version - valid xml with an end tag. Fixed XML is generated using simple string slicing.

This regex won't work if:

  • the opening tag has XML attributes
  • there is a strange distance - space between angle brackets containing the tag name
  • tag names use dashes or numbers or anything that is not an ASCII lowercase character
  • the line between includes angle brackets (in CDATA)

... but the approach will still work. You just need to tweak the settings a little.

+3


source


If the only problem with your xml files is what you showed, then Chesso's answer should suffice. In fact, I would go this route even if it fills my 80-90% of the needs - the rest of the cases, I can pick a pen by hand or write specific handling code.

It has been said that if the structure of the file is complex and not simple as you describe, then you should probably look at some text lexer that will allow you to split the contents of your file into tokens. The semantic analysis of tokens for validation and error correction should be done by you, but at least the text analysis will be much easier. See a few resources below for that links to lexing in C #:

0


source


Better not to think of it as XML files: they are non-XML files. This immediately tells you that tools designed to handle XML will be useless because the input is not XML. You need to use text tools. On UNIX, these would be things like sed / awk / perl; I have no idea what the equivalent would be on Windows.

-1


source







All Articles