Python xml - handle unclosed token

I am reading hundreds of XML files and parsing them with xml.etree.ElementTree.

Fast background only fwiw: These XML files were completely valid at some point, but somehow while processing them historically my process that copied / pasted them might mess them up. (Turns out it was an issue with clearing / with an expression not closing, if you're curious, see what kind of help I got on this investigation ... Python shutil copyfile - skip the last few lines ).

In any case, let's get back to this question.
I would still like to read the first 100,000 lines of these documents, which are valid XML. The files are missing only the last 4 or 5 KB of 6 MB of the file. As mentioned earlier, the file is simply "cut out". it looks like this:

</Maintag>




<Maintag>
    <Change_type>NQ</Change_type>
    <Name>Atlas</Name>
    <Test>ATLS</Test>
    <Other>NYSE</Other>
    <Scheduled_E

      

where (perhaps obviously) Scheduled_E is the start of what should be another attribute, <.Scheduled_Event>, say. But the file cuts off the middle mark. Once again, up to this point in the file, there are several thousand "good" "Maintag" entries that I would like to read, taking the truncation entry (and obviously everything that should have happened after) as a fatal crash.

A simple but incomplete way to deal with this might be to simply process the pre-XML โ€” find the last file of the <./ Maintag> line in the file and replace what follows (which will be broken, at some point) with the "open" tags. Again, this at least allows me to handle what is still there and really.

If anyone wants to help me with such a string replacement then fwiw opening tags:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<FirstTag>
    <Source FileName="myfile">

      

I hope it's even easier, maybe elementtree or beautifulsoup or some other way to handle this situation ... I've done a decent amount of searching and nothing seems easy / obvious.

thank

+3


source to share


1 answer


For dealing with unclosed elements - or a token, as the title of this assignment suggests, I recommend giving it a try lxml

. lxml

XMLParser

has a parameter recover

which is documented as:

recovery - try to parse through broken XML

For example, given the broken XML:



from lxml import etree

xml = """
<root>
    <Maintag>
        <Change_type>NQ</Change_type>
        <Name>Atlas</Name>
        <Test>ATLS</Test>
        <Other>NYSE</Other>
        <Scheduled_E
"""
parser = etree.XMLParser(recover=True)
doc = etree.fromstring(xml, parser=parser)
print(etree.tostring(doc))

      

The recovered XML printed with the above code looks like this:

<root>
    <Maintag>
        <Change_type>NQ</Change_type>
        <Name>Atlas</Name>
        <Test>ATLS</Test>
        <Other>NYSE</Other>
        <Scheduled_E/></Maintag></root>

      

+2


source







All Articles