Problem creating nested xml from flat xml

Question

Problem creating nested xml from flat xml

I am trying to create nested xml from flat XML using XSLT, however I found that it only creates one nest and ignores the rest of the entries in the original XML.

My XML input looks like this:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!-- Data -->
<table name="ecatalogue">
  <!-- Row 1 -->
  <tuple>
    <atom name="irn">2470</atom>
    <atom name="EADUnitID">da.01</atom>
    <atom name="EADUnitTitle">Some title</atom>
    <tuple name="AssParentObjectRef" />
  </tuple>
    <!-- Row 2 -->
  <tuple>
    <atom name="irn">5416</atom>
    <atom name="EADUnitID">da.01.01</atom>
    <atom name="EADUnitTitle">Child of Some title</atom>
    <tuple name="AssParentObjectRef">
    <atom name="EADUnitTitle">Some Title</atom>
    <atom name="irn">2470</atom>
    </tuple>
  </tuple>
    <!-- Row 3 -->
  <tuple>
    <atom name="irn">6</atom>
    <atom name="EADUnitID">da.01.02</atom>
    <atom name="EADUnitTitle">Child of Some title 2</atom>
    <tuple name="AssParentObjectRef">
    <atom name="EADUnitTitle">Some Title</atom>
    <atom name="irn">2470</atom>
    </tuple>
  </tuple>
    <!-- Row 4 -->
  <tuple>
    <atom name="irn">8</atom>
    <atom name="EADUnitID">da.01.02.01</atom>
    <atom name="EADUnitTitle">3rd Generation</atom>
    <tuple name="AssParentObjectRef">
    <atom name="EADUnitTitle">Child of Some Title 2</atom>
    <atom name="irn">6</atom>
    </tuple>
  </tuple>
    <!-- Row 5 -->
  <tuple>
    <atom name="irn">1130</atom>
    <atom name="EADUnitID">da.02</atom>
    <atom name="EADUnitTitle">Another title</atom>
    <tuple name="AssParentObjectRef" />
  </tuple>
    <!-- Row 6 -->
  <tuple>
    <atom name="irn">54</atom>
    <atom name="EADUnitID">da.02.01</atom>
    <atom name="EADUnitTitle">Child of Another title</atom>
    <tuple name="AssParentObjectRef">
    <atom name="EADUnitTitle">Another Title</atom>
    <atom name="irn">1130</atom>
    </tuple>
  </tuple>
    <!-- Row 7 -->
  <tuple>
    <atom name="irn">16</atom>
    <atom name="EADUnitID">da.02.02</atom>
    <atom name="EADUnitTitle">Child of Another Title 2</atom>
    <tuple name="AssParentObjectRef">
    <atom name="EADUnitTitle">Another Title</atom>
    <atom name="irn">1130</atom>
    </tuple>
  </tuple>
    <!-- Row 8 -->
  <tuple>
    <atom name="irn">22</atom>
    <atom name="EADUnitID">da.02.02.01</atom>
    <atom name="EADUnitTitle">3rd Generation</atom>
    <tuple name="AssParentObjectRef">
    <atom name="EADUnitTitle">Child of Another Title 2</atom>
    <atom name="irn">1130</atom>
    </tuple>
  </tuple>
</table>

XSLT needs to identify the top-level record and then add children. For the top entry, it must duplicate its irn and EADUnitTitle as TopID and TopTitle, respectively. For each child, it must include the immediate ParentID and ParentTitle, as well as the TopID and TopTitle. The result should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<table name="ecatalogue">
   <collection>
      <tuple>
         <atom name="irn">2470</atom>
         <atom name="EADUnitID">da.01</atom>
         <atom name="EADUnitTitle">Some title</atom>
         <atom name="TopTitle">Some title</atom>
         <atom name="TopID">2470</atom>
         <tuple name="children">
            <tuple>
               <atom name="irn">5416</atom>
               <atom name="EADUnitID">da.01.01</atom>
               <atom name="EADUnitTitle">Child of Some title</atom>
               <atom name="ParentTitle">Some title</atom>
               <atom name="ParentID">2470</atom>
               <atom name="TopTitle">Some title</atom>
               <atom name="TopID">2470</atom>
            </tuple>
            <tuple>
                <atom name="irn">6</atom>
               <atom name="EADUnitID">da.01.02</atom>
               <atom name="EADUnitTitle">Child of Some title 2</atom>
               <atom name="ParentTitle">Some title</atom>
               <atom name="ParentID">2470</atom>
               <atom name="TopTitle">Some title</atom>
               <atom name="TopID">2470</atom>
               <tuple name="children">
                  <tuple>
                    <atom name="irn">8</atom>
                    <atom name="EADUnitID">da.01.02.01</atom>
                    <atom name="EADUnitTitle">3rd Generation</atom>
                    <atom name="ParentTitle">Child of Some title 2</atom>
                    <atom name="ParentID">6</atom>
                    <atom name="TopTitle">Some title</atom>
                    <atom name="TopID">2470</atom>
                  </tuple>
               </tuple>
            </tuple>
         </tuple>
      </tuple>
   </collection>
   <collection>
      <tuple>
         <atom name="irn">1130</atom>
         <atom name="EADUnitID">da.02</atom>
         <atom name="EADUnitTitle">Another title</atom>
         <atom name="TopTitle">Another title</atom>
         <atom name="TopID">1130</atom>
         <tuple name="children">
            <tuple>
               <atom name="irn">54</atom>
               <atom name="EADUnitID">da.02.01</atom>
               <atom name="EADUnitTitle">Child of Another title</atom>
               <atom name="ParentTitle">Another title</atom>
               <atom name="ParentID">1130</atom>
               <atom name="TopTitle">Another title</atom>
               <atom name="TopID">1130</atom>
            </tuple>
            <tuple>
                <atom name="irn">16</atom>
               <atom name="EADUnitID">da.02.02</atom>
               <atom name="EADUnitTitle">Child of Another title 2</atom>
               <atom name="ParentTitle">Another title</atom>
               <atom name="ParentID">1130</atom>
               <atom name="TopTitle">Another title</atom>
               <atom name="TopID">1130</atom>
               <tuple name="children">
                  <tuple>
                    <atom name="irn">22</atom>
                    <atom name="EADUnitID">da.02.02.01</atom>
                    <atom name="EADUnitTitle">3rd Generation</atom>
                    <atom name="ParentTitle">Child of Another title 2</atom>
                    <atom name="ParentID">16</atom>
                    <atom name="TopTitle">Another title</atom>
                    <atom name="TopID">1130</atom>
                  </tuple>
               </tuple>
            </tuple>
         </tuple>
      </tuple>

....

   </collection>
</table>

I have XSLT:

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:key name="child" match="tuple" use="tuple[@name='AssParentObjectRef']/atom[@name='irn']" />

<xsl:template match="/table">
    <table name="ecatalogue">
        <collection>
            <xsl:apply-templates select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"/>
        </collection>
    </table>
</xsl:template>

<xsl:template match="tuple">
    <tuple>
        <xsl:copy-of select="atom"/>
        <xsl:if test="key('child', atom[@name='irn'])">
            <tuple name="children">
                <xsl:apply-templates select="key('child', atom[@name='irn'])"/>
             </tuple>
        </xsl:if>
    </tuple>
</xsl:template>

</xsl:stylesheet>

And while this will group the records, the output is just one of those collections. Therefore, from a file of 3524 records, I get one collection of 24 records.

I experimented with replacing XSLT:

<xsl:template match="/table">
    <table name="ecatalogue">
        <collection>
            <xsl:apply-templates select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"/>
        </collection>
    </table>
</xsl:template>

FROM

<xsl:template match="node()|@*">
   <xsl:copy>
     <xsl:apply-templates select="node()|@*"/>
 </xsl:copy>
</xsl:template>

And while this returns all nested structures, it also duplicates the entries in the nests, so they become collections on their own.

Any ideas on where I am going wrong?

EDIT 06/06/17

When I use:

<xsl:template match="node()|@*">
       <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
    </xsl:template>

I am getting duplicates ( note: the "id" in the example below is added for illustration):

 <record id='1'>
   <children>
        <record id='2'>
            <children>
                <record id='3'>
                    <children>
                        <record id='4'></record>
                    </children>
                </record>
            </children>
        </record>
    </children>
</record>

<record id='2'>
        <children>
            <record id='3'>
                <children>
                    <record id='4'></record>
                </children>
            </record>
        </children>
</record>

<record id='3'>
    <children>
        <record id='4'></record>
    </children>
</record>

<record id='2'></record>
<record id='3'></record>
<record id='4'></record>

Is there a way to remove duplicates so that I am left with nested records?

EDIT - problematic tuples

 <!-- Row 3378 -->
  <tuple>
    <atom name="irn">115024</atom>
    <atom name="ObjectType">Archives</atom>
    <atom name="EADLevelAttribute">Series</atom>
    <atom name="EADUnitID">D42.PL.05</atom>
    <atom name="EADUnitTitle">Correspondence and Company Administration: Box Files</atom>
    <atom name="EADScopeAndContent">Box files of Port Line official company correspondence and administrative papers. These papers were collected towards historical research and include correspondence from earlier periods c.1890 although the bulk of the papers relate to the two periods 1937-1939 and 1949-1951.</atom>
    <atom name="EADBiographyOrHistory"></atom>
    <tuple name="AssParentObjectRef">
    </tuple>
    <atom name="EADArrangement">The papers in this series have been retained in the original order as stored by Port Line Ltd. The contents of each box file are listed as a typescript paper and have been listed in this catalogue. Box file titles have been listed in the title field of each item in this series.</atom>
    <atom name="EADUnitDate">1890-1952</atom>
    <table name="EADExtent_tab">
      <tuple>
        <atom name="EADExtent">7 boxes.</atom>
      </tuple>
    </table>
    <atom name="EADAccruals"></atom>
    <atom name="EADOtherFindingAid"></atom>
    <atom name="EADRelatedMaterial"></atom>
    <tuple name="EADAcquisitionInformationRef">
    </tuple>
    <atom name="EADAppraisalInformation"></atom>
    <atom name="EADSeparatedMaterial"></atom>
    <atom name="EADTitleProper"></atom>
    <atom name="EADPublicationStatement"></atom>
    <atom name="EADCustodialHistory"></atom>
    <atom name="EADSource"></atom>
    <atom name="EADNote"></atom>
    <atom name="EADAccessRestrictions">Some items in this series are closed access.</atom>
    <atom name="EADUseRestrictions"></atom>
  </tuple>

  <!-- Row 3379 -->
  <tuple>
    <atom name="irn">115025</atom>
    <atom name="ObjectType">Archives</atom>
    <atom name="EADLevelAttribute">Item</atom>
    <atom name="EADUnitID">D42.PL.05.01</atom>
    <atom name="EADUnitTitle">File: Australian Homeward Trade</atom>
    <atom name="EADScopeAndContent">Various papers relating to Australian Homeward Trade and includes the following:For proof copies of the Australian Homeward Agreement see D42/PL5/6.</atom>
    <atom name="EADBiographyOrHistory"></atom>
    <tuple name="AssParentObjectRef">
      <atom name="EADUnitTitle">Correspondence and Company Administration: Box Files</atom>
      <atom name="irn">115024</atom>
    </tuple>
    <atom name="EADArrangement"></atom>
    <atom name="EADUnitDate">1920-1936</atom>
    <table name="EADExtent_tab">
      <tuple>
        <atom name="EADExtent">1 file.</atom>
      </tuple>
    </table>
    <atom name="EADAccruals"></atom>
    <atom name="EADOtherFindingAid"></atom>
    <atom name="EADRelatedMaterial"></atom>
    <tuple name="EADAcquisitionInformationRef">
    </tuple>
    <atom name="EADAppraisalInformation"></atom>
    <atom name="EADSeparatedMaterial"></atom>
    <atom name="EADTitleProper"></atom>
    <atom name="EADPublicationStatement"></atom>
    <atom name="EADCustodialHistory"></atom>
    <atom name="EADSource"></atom>
    <atom name="EADNote"></atom>
    <atom name="EADAccessRestrictions"></atom>
    <atom name="EADUseRestrictions"></atom>
  </tuple>

+3

xml xslt

joesch 05 May '17 at 9:34

source to share

2 answers

Tim C · Answer 1 · 2017-05-05T09:51:36+0000

If you want an element collection

for each top-level parent tuple

, I think all you have to do is have xsl:for-each

to get the parents, and move the creation of the elements collection

to that.

<xsl:template match="/table">
    <table name="ecatalogue">
        <xsl:for-each select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]">
            <collection>
                <xsl:apply-templates select="." />
            </collection>
        </xsl:for-each>
    </table>
</xsl:template>

Eiríkr Útlendi · Answer 2 · 2017-06-09T00:56:43+0000

It's a little longer; I try to appeal to everything that caught my attention.

Existing templates

First, let's break down what's going on with your own XSL code.

<xsl:template match="/table">
    <table name="ecatalogue">
        <collection>
            <xsl:apply-templates select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"/>
        </collection>
    </table>
</xsl:template>

You have <collection>

here in the template that matches /table

. Since there is only one match /table

, you will only have one <collection>

in the output.

Also, your selection of top-level items tuple

( tuple

which are not related to the parent) can be simplified. Instead:

select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"

you could just say:

select="tuple[not(tuple/*)]"

as you can tell from your example that the top level tuple

only contains an empty self-closing tag <tuple>

.

Next bit:

<xsl:key name="child" match="tuple" use="tuple[@name='AssParentObjectRef']/atom[@name='irn']" />

So, we have matching keys to tuple

and using the parent identifier tuple

.

<xsl:template match="tuple">
    <tuple>
        <xsl:copy-of select="atom"/>

        <!-- If this `tuple` is a parent (i.e. if it included in
             the list of parent IDs in the key), then we add a
             wrapper for the children and process the children.  -->
        <xsl:if test="key('child', atom[@name='irn'])">
            <tuple name="children">
                <!-- Now we apply templates to the `tuple`s 
                     in the key -->
                <xsl:apply-templates select="key('child', atom[@name='irn'])"/>
            </tuple>
        </xsl:if>
    </tuple>
</xsl:template>

It basically works. Comparing the results of this to your sample of the desired output, the bits you're missing are the wrapper tag <collection>

(see above) and the parent and top-level predecessor names and names (for which you don't have XSL code).

You indicate

"And while this is grouping records, the output will only be one of those collections. So from a file of 3524 records, I get one set of 24 records."

I can assume that the XML structure of the rest of your actual input may be different from what your XSL is targeting. But without seeing your actual input, I cannot tell why this might be the case.

Your edit

You describe the addition in the following template:

<xsl:template match="node()|@*">
    <xsl:copy>
        <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
</xsl:template>

This is the "identity" pattern, so called because it just copies the elements the same way. The duplication you see is expected: the XSLT processor goes through flat typing tuple

in the input file and copies it, and then processes some of it in the context of a specially matched tuple

template, which applies the templates to tuple

that match the key (nested children).

An alternative approach

The template is table

not too different:

<xsl:template match="table">
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <!-- Target only parent-level `tuple`s.  This excludes
            child-level tuples, helping to prevent duplicates.-->
        <xsl:apply-templates select="tuple[not(tuple/*)]" mode="top"/>
    </xsl:copy>
</xsl:template>

Remarkably, we are not doing anything with <collection>

here - we want to wrap each top level tuple

(and its children) in <collection>

, so we need to add <collection>

at the level tuple

.

<!-- Add the `collection` wrapper only to top-level tuples -->
<xsl:template match="tuple" mode="top">
    <collection>
        <!-- Pass on this tuple to the main `tuple` template -->
        <xsl:apply-templates select="."/>
    </collection>
</xsl:template>

I'm matching tuple

, but using special mode - we only want to add a wrapper <collection>

for the top level tuple

s. Then I have a separate template that handles all tuple

s:

<!-- This is the main template for processing `tuple` elements.
    Most of the changes needed are common to all `tuple`s, so it
    makes sense to keep all the logic in one place. -->
<xsl:template match="tuple">
    <tuple>
        <!-- Copy each existing `atom` child -->
        <xsl:copy-of select="atom"/>
        <!-- Add in metadata about parent and top-level ancestor titles and IDs -->
            <xsl:choose>
                <!-- If this is a top-level item, just use its own values -->
                <xsl:when test="not(tuple/*)">
                    <atom name="TopTitle"><xsl:value-of select="atom[@name='EADUnitTitle']"/></atom>
                    <atom name="TopID"><xsl:value-of select="atom[@name='irn']"/></atom>
                </xsl:when>
                <!-- If this is a descendant, we need to find its parent and its top-level ancestor -->
                <xsl:when test="tuple/*">
                    <atom name="ParentTitle"><xsl:value-of select="tuple/atom[@name='EADUnitTitle']"/></atom>
                    <atom name="ParentID"><xsl:value-of select="tuple/atom[@name='irn']"/></atom>
                <!-- For convenience, grab the top-level ancestor `tuple` and stuff it in a variable.
                    This is vaguely annalogous to your use of `key`. -->

                <!-- Finding the top-level `tuple` is complicated by the fact that the ID values in 
                    `<atom name="irn">` do not have a standardized format, other than that the whole
                    strings appear to consist of atomic values separated by single periods, with 
                    descendant `irn` values appending to the precedent values.  Examples:
                      Top:        `da.04`
                      Descendant: `da.04.11.02`
                      Top:        `D42.PL.05`
                      Descendant: `D42.PL.05.01`
                    So chunking the ID values is a problematic approach, since we don't know how many
                    chunks comprise the initial non-numeric portion: `da`, or `D42.PL`, or ... ???.
                    Top-level elements *do* also have empty `<tuple name="AssParentObjectRef">` elements.
                    So we _can_ find all the top-level elements, and then look in those for the one that
                    has an `irn` value that matches the start of the `irn` value of this current `tuple`. -->
                <xsl:variable name="top" select="/table/tuple[tuple[@name='AssParentObjectRef'][not(*)]]
                    ['The above statement grabs all the `tuple`s that have an empty `tuple[@name=`AssParentObjectRef``.
                      The below statement then goes through all those `tuple`s to find the ones where the `irn`
                      values match the start of the `irn` value of the current `tuple`.']
                    [starts-with(current()/atom[@name='EADUnitID'], atom[@name='EADUnitID'])]"/>
                    <!-- Now we can reference that variable to get the top-level ancestor values -->
                    <atom name="TopTitle"><xsl:value-of select="$top/atom[@name='EADUnitTitle']"/></atom>
                    <atom name="TopID"><xsl:value-of select="$top/atom[@name='irn']"/></atom>
                </xsl:when>
            </xsl:choose>
        <!-- Process any children of this tuple, based on `irn` values.
            Basically, we look for any other `tuple`s in the `table`
            that point to this current `tuple` `irn` value. -->
        <xsl:if test="/table/tuple[tuple/atom[@name='irn'] = current()/atom[@name='irn']]">
            <tuple name="children">
                <xsl:apply-templates select="/table/tuple[tuple/atom[@name='irn'] = current()/atom[@name='irn']]"></xsl:apply-templates>
            </tuple>
        </xsl:if>
    </tuple>
</xsl:template>

I cannot speak to your complete dataset of 3524 records, but by doing the above against your sample XML input, my output is identical to your desired output (except for one error in your XML input example about a reference irn

value mentioned in a comment on your original message).

Suggestions for the desired XML output

As a data format, the desired output XML has a number of aspects that strike me as a bit odd.

Wrapper <collection>

This seems overkill; it is easy enough to see if a given tuple is 1) top level and 2) has children.
Headers and IDs Parent

and Top

They seem overkill too. As long as your data is structured hierarchically, it all becomes clear without any need to specifically include it. Including this metadata simply duplicates the information you already have.
Wrapper <tuple name="children">

This can also be omitted without losing information. The mere presence of one element tuple

nested within another is enough children.

I don't know if you have any control over the design or influence of the output XML file format, but if you do, I would suggest a shorter and more ordered structure such as:

<table name="ecatalogue">
    <tuple>
        <atom name="irn">2470</atom>
        <atom name="EADUnitID">da.01</atom>
        <atom name="EADUnitTitle">Some title</atom>
        <tuple>
            <atom name="irn">5416</atom>
            <atom name="EADUnitID">da.01.01</atom>
            <atom name="EADUnitTitle">Child of Some title</atom>
        </tuple>
        <tuple>
            <atom name="irn">6</atom>
            <atom name="EADUnitID">da.01.02</atom>
            <atom name="EADUnitTitle">Child of Some title 2</atom>
            <tuple>
                <atom name="irn">8</atom>
                <atom name="EADUnitID">da.01.02.01</atom>
                <atom name="EADUnitTitle">3rd Generation</atom>
            </tuple>
        </tuple>
    </tuple>
</table>

In this structure, a tuple

can only contain atom

or other tuple

s. We can identify a collection by simply finding any tuple

top level that contains other tuple

s. We can identify children by simply finding anyone tuple

who has a parent tuple

. We can find the top level and parent level headers and IDs by simply selecting tuple

and looking further down the element tree.

This structure is simpler, avoids data duplication, and is arguably clearer and easier to process. However, you know your needs! Do what works for you. :)

Please go through the code and comments, and let me know if you have any lingering questions.

Update 2017-06-15: issues

I added these to your previous XML data sample and tried to apply my previous XSL code. When looking at the problematic tuples, two things happened:

My previous code uses a feature tokenize

that is only available in XSL 2.0 and later. Your post tags didn't specify XSL 1.0, and I didn't notice that you specified that in your example XSL header.

I reworked the XSL code above (template that matches tuple

) to rely only on XSL 1.0 features.
The values EADUnitID

in the original XML input sample are not representative, so any attempt to code specifically for that sample will fail when applied to your full, invisible input XML.

Your original sample includes only EADUnitID

format values da.XX

, where XX

are numbers, and the pattern .XX

can be repeated. I've made some assumptions on how to trick this string and compare shapes.

However, your tuples have values in a EADUnitID

very different format, which looks like DXX.PL.XX

where again XX

are numbers, and the pattern .XX

at the end may be repeated. This means that relying on chunk matching between periods is not a safe approach.

I reworked the XSL code to fit the entire line front instead irn

, which works reliably.

Take a look at the code and comments, and let me know if anything remains unclear or non-functional.

Problem creating nested xml from flat xml

Existing templates

Your edit

An alternative approach

Suggestions for the desired XML output

Update 2017-06-15: issues

More articles: