Problem creating nested xml from flat xml
I am trying to create nested xml from flat XML using XSLT, however I found that it only creates one nest and ignores the rest of the entries in the original XML.
My XML input looks like this:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!-- Data -->
<table name="ecatalogue">
<!-- Row 1 -->
<tuple>
<atom name="irn">2470</atom>
<atom name="EADUnitID">da.01</atom>
<atom name="EADUnitTitle">Some title</atom>
<tuple name="AssParentObjectRef" />
</tuple>
<!-- Row 2 -->
<tuple>
<atom name="irn">5416</atom>
<atom name="EADUnitID">da.01.01</atom>
<atom name="EADUnitTitle">Child of Some title</atom>
<tuple name="AssParentObjectRef">
<atom name="EADUnitTitle">Some Title</atom>
<atom name="irn">2470</atom>
</tuple>
</tuple>
<!-- Row 3 -->
<tuple>
<atom name="irn">6</atom>
<atom name="EADUnitID">da.01.02</atom>
<atom name="EADUnitTitle">Child of Some title 2</atom>
<tuple name="AssParentObjectRef">
<atom name="EADUnitTitle">Some Title</atom>
<atom name="irn">2470</atom>
</tuple>
</tuple>
<!-- Row 4 -->
<tuple>
<atom name="irn">8</atom>
<atom name="EADUnitID">da.01.02.01</atom>
<atom name="EADUnitTitle">3rd Generation</atom>
<tuple name="AssParentObjectRef">
<atom name="EADUnitTitle">Child of Some Title 2</atom>
<atom name="irn">6</atom>
</tuple>
</tuple>
<!-- Row 5 -->
<tuple>
<atom name="irn">1130</atom>
<atom name="EADUnitID">da.02</atom>
<atom name="EADUnitTitle">Another title</atom>
<tuple name="AssParentObjectRef" />
</tuple>
<!-- Row 6 -->
<tuple>
<atom name="irn">54</atom>
<atom name="EADUnitID">da.02.01</atom>
<atom name="EADUnitTitle">Child of Another title</atom>
<tuple name="AssParentObjectRef">
<atom name="EADUnitTitle">Another Title</atom>
<atom name="irn">1130</atom>
</tuple>
</tuple>
<!-- Row 7 -->
<tuple>
<atom name="irn">16</atom>
<atom name="EADUnitID">da.02.02</atom>
<atom name="EADUnitTitle">Child of Another Title 2</atom>
<tuple name="AssParentObjectRef">
<atom name="EADUnitTitle">Another Title</atom>
<atom name="irn">1130</atom>
</tuple>
</tuple>
<!-- Row 8 -->
<tuple>
<atom name="irn">22</atom>
<atom name="EADUnitID">da.02.02.01</atom>
<atom name="EADUnitTitle">3rd Generation</atom>
<tuple name="AssParentObjectRef">
<atom name="EADUnitTitle">Child of Another Title 2</atom>
<atom name="irn">1130</atom>
</tuple>
</tuple>
</table>
XSLT needs to identify the top-level record and then add children. For the top entry, it must duplicate its irn and EADUnitTitle as TopID and TopTitle, respectively. For each child, it must include the immediate ParentID and ParentTitle, as well as the TopID and TopTitle. The result should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<table name="ecatalogue">
<collection>
<tuple>
<atom name="irn">2470</atom>
<atom name="EADUnitID">da.01</atom>
<atom name="EADUnitTitle">Some title</atom>
<atom name="TopTitle">Some title</atom>
<atom name="TopID">2470</atom>
<tuple name="children">
<tuple>
<atom name="irn">5416</atom>
<atom name="EADUnitID">da.01.01</atom>
<atom name="EADUnitTitle">Child of Some title</atom>
<atom name="ParentTitle">Some title</atom>
<atom name="ParentID">2470</atom>
<atom name="TopTitle">Some title</atom>
<atom name="TopID">2470</atom>
</tuple>
<tuple>
<atom name="irn">6</atom>
<atom name="EADUnitID">da.01.02</atom>
<atom name="EADUnitTitle">Child of Some title 2</atom>
<atom name="ParentTitle">Some title</atom>
<atom name="ParentID">2470</atom>
<atom name="TopTitle">Some title</atom>
<atom name="TopID">2470</atom>
<tuple name="children">
<tuple>
<atom name="irn">8</atom>
<atom name="EADUnitID">da.01.02.01</atom>
<atom name="EADUnitTitle">3rd Generation</atom>
<atom name="ParentTitle">Child of Some title 2</atom>
<atom name="ParentID">6</atom>
<atom name="TopTitle">Some title</atom>
<atom name="TopID">2470</atom>
</tuple>
</tuple>
</tuple>
</tuple>
</tuple>
</collection>
<collection>
<tuple>
<atom name="irn">1130</atom>
<atom name="EADUnitID">da.02</atom>
<atom name="EADUnitTitle">Another title</atom>
<atom name="TopTitle">Another title</atom>
<atom name="TopID">1130</atom>
<tuple name="children">
<tuple>
<atom name="irn">54</atom>
<atom name="EADUnitID">da.02.01</atom>
<atom name="EADUnitTitle">Child of Another title</atom>
<atom name="ParentTitle">Another title</atom>
<atom name="ParentID">1130</atom>
<atom name="TopTitle">Another title</atom>
<atom name="TopID">1130</atom>
</tuple>
<tuple>
<atom name="irn">16</atom>
<atom name="EADUnitID">da.02.02</atom>
<atom name="EADUnitTitle">Child of Another title 2</atom>
<atom name="ParentTitle">Another title</atom>
<atom name="ParentID">1130</atom>
<atom name="TopTitle">Another title</atom>
<atom name="TopID">1130</atom>
<tuple name="children">
<tuple>
<atom name="irn">22</atom>
<atom name="EADUnitID">da.02.02.01</atom>
<atom name="EADUnitTitle">3rd Generation</atom>
<atom name="ParentTitle">Child of Another title 2</atom>
<atom name="ParentID">16</atom>
<atom name="TopTitle">Another title</atom>
<atom name="TopID">1130</atom>
</tuple>
</tuple>
</tuple>
</tuple>
</tuple>
....
</collection>
</table>
I have XSLT:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="child" match="tuple" use="tuple[@name='AssParentObjectRef']/atom[@name='irn']" />
<xsl:template match="/table">
<table name="ecatalogue">
<collection>
<xsl:apply-templates select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"/>
</collection>
</table>
</xsl:template>
<xsl:template match="tuple">
<tuple>
<xsl:copy-of select="atom"/>
<xsl:if test="key('child', atom[@name='irn'])">
<tuple name="children">
<xsl:apply-templates select="key('child', atom[@name='irn'])"/>
</tuple>
</xsl:if>
</tuple>
</xsl:template>
</xsl:stylesheet>
And while this will group the records, the output is just one of those collections. Therefore, from a file of 3524 records, I get one collection of 24 records.
I experimented with replacing XSLT:
<xsl:template match="/table">
<table name="ecatalogue">
<collection>
<xsl:apply-templates select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"/>
</collection>
</table>
</xsl:template>
FROM
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
And while this returns all nested structures, it also duplicates the entries in the nests, so they become collections on their own.
Any ideas on where I am going wrong?
EDIT 06/06/17
When I use:
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
I am getting duplicates ( note: the "id" in the example below is added for illustration):
<record id='1'>
<children>
<record id='2'>
<children>
<record id='3'>
<children>
<record id='4'></record>
</children>
</record>
</children>
</record>
</children>
</record>
<record id='2'>
<children>
<record id='3'>
<children>
<record id='4'></record>
</children>
</record>
</children>
</record>
<record id='3'>
<children>
<record id='4'></record>
</children>
</record>
<record id='2'></record>
<record id='3'></record>
<record id='4'></record>
Is there a way to remove duplicates so that I am left with nested records?
EDIT - problematic tuples
<!-- Row 3378 -->
<tuple>
<atom name="irn">115024</atom>
<atom name="ObjectType">Archives</atom>
<atom name="EADLevelAttribute">Series</atom>
<atom name="EADUnitID">D42.PL.05</atom>
<atom name="EADUnitTitle">Correspondence and Company Administration: Box Files</atom>
<atom name="EADScopeAndContent">Box files of Port Line official company correspondence and administrative papers. These papers were collected towards historical research and include correspondence from earlier periods c.1890 although the bulk of the papers relate to the two periods 1937-1939 and 1949-1951.</atom>
<atom name="EADBiographyOrHistory"></atom>
<tuple name="AssParentObjectRef">
</tuple>
<atom name="EADArrangement">The papers in this series have been retained in the original order as stored by Port Line Ltd. The contents of each box file are listed as a typescript paper and have been listed in this catalogue. Box file titles have been listed in the title field of each item in this series.</atom>
<atom name="EADUnitDate">1890-1952</atom>
<table name="EADExtent_tab">
<tuple>
<atom name="EADExtent">7 boxes.</atom>
</tuple>
</table>
<atom name="EADAccruals"></atom>
<atom name="EADOtherFindingAid"></atom>
<atom name="EADRelatedMaterial"></atom>
<tuple name="EADAcquisitionInformationRef">
</tuple>
<atom name="EADAppraisalInformation"></atom>
<atom name="EADSeparatedMaterial"></atom>
<atom name="EADTitleProper"></atom>
<atom name="EADPublicationStatement"></atom>
<atom name="EADCustodialHistory"></atom>
<atom name="EADSource"></atom>
<atom name="EADNote"></atom>
<atom name="EADAccessRestrictions">Some items in this series are closed access.</atom>
<atom name="EADUseRestrictions"></atom>
</tuple>
<!-- Row 3379 -->
<tuple>
<atom name="irn">115025</atom>
<atom name="ObjectType">Archives</atom>
<atom name="EADLevelAttribute">Item</atom>
<atom name="EADUnitID">D42.PL.05.01</atom>
<atom name="EADUnitTitle">File: Australian Homeward Trade</atom>
<atom name="EADScopeAndContent">Various papers relating to Australian Homeward Trade and includes the following:For proof copies of the Australian Homeward Agreement see D42/PL5/6.</atom>
<atom name="EADBiographyOrHistory"></atom>
<tuple name="AssParentObjectRef">
<atom name="EADUnitTitle">Correspondence and Company Administration: Box Files</atom>
<atom name="irn">115024</atom>
</tuple>
<atom name="EADArrangement"></atom>
<atom name="EADUnitDate">1920-1936</atom>
<table name="EADExtent_tab">
<tuple>
<atom name="EADExtent">1 file.</atom>
</tuple>
</table>
<atom name="EADAccruals"></atom>
<atom name="EADOtherFindingAid"></atom>
<atom name="EADRelatedMaterial"></atom>
<tuple name="EADAcquisitionInformationRef">
</tuple>
<atom name="EADAppraisalInformation"></atom>
<atom name="EADSeparatedMaterial"></atom>
<atom name="EADTitleProper"></atom>
<atom name="EADPublicationStatement"></atom>
<atom name="EADCustodialHistory"></atom>
<atom name="EADSource"></atom>
<atom name="EADNote"></atom>
<atom name="EADAccessRestrictions"></atom>
<atom name="EADUseRestrictions"></atom>
</tuple>
source to share
If you want an element collection
for each top-level parent tuple
, I think all you have to do is have xsl:for-each
to get the parents, and move the creation of the elements collection
to that.
<xsl:template match="/table">
<table name="ecatalogue">
<xsl:for-each select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]">
<collection>
<xsl:apply-templates select="." />
</collection>
</xsl:for-each>
</table>
</xsl:template>
source to share
It's a little longer; I try to appeal to everything that caught my attention.
Existing templates
First, let's break down what's going on with your own XSL code.
<xsl:template match="/table">
<table name="ecatalogue">
<collection>
<xsl:apply-templates select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"/>
</collection>
</table>
</xsl:template>
-
You have
<collection>
here in the template that matches/table
. Since there is only one match/table
, you will only have one<collection>
in the output. -
Also, your selection of top-level items
tuple
(tuple
which are not related to the parent) can be simplified. Instead:select="tuple[not(tuple[@name='AssParentObjectRef']/atom[@name='irn'])]"
you could just say:
select="tuple[not(tuple/*)]"
as you can tell from your example that the top level
tuple
only contains an empty self-closing tag<tuple>
.
Next bit:
<xsl:key name="child" match="tuple" use="tuple[@name='AssParentObjectRef']/atom[@name='irn']" />
So, we have matching keys to tuple
and using the parent identifier tuple
.
<xsl:template match="tuple">
<tuple>
<xsl:copy-of select="atom"/>
<!-- If this `tuple` is a parent (i.e. if it included in
the list of parent IDs in the key), then we add a
wrapper for the children and process the children. -->
<xsl:if test="key('child', atom[@name='irn'])">
<tuple name="children">
<!-- Now we apply templates to the `tuple`s
in the key -->
<xsl:apply-templates select="key('child', atom[@name='irn'])"/>
</tuple>
</xsl:if>
</tuple>
</xsl:template>
It basically works. Comparing the results of this to your sample of the desired output, the bits you're missing are the wrapper tag <collection>
(see above) and the parent and top-level predecessor names and names (for which you don't have XSL code).
You indicate
"And while this is grouping records, the output will only be one of those collections. So from a file of 3524 records, I get one set of 24 records."
I can assume that the XML structure of the rest of your actual input may be different from what your XSL is targeting. But without seeing your actual input, I cannot tell why this might be the case.
Your edit
You describe the addition in the following template:
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
This is the "identity" pattern, so called because it just copies the elements the same way. The duplication you see is expected: the XSLT processor goes through flat typing tuple
in the input file and copies it, and then processes some of it in the context of a specially matched tuple
template, which applies the templates to tuple
that match the key (nested children).
An alternative approach
The template is table
not too different:
<xsl:template match="table">
<xsl:copy>
<xsl:copy-of select="@*"/>
<!-- Target only parent-level `tuple`s. This excludes
child-level tuples, helping to prevent duplicates.-->
<xsl:apply-templates select="tuple[not(tuple/*)]" mode="top"/>
</xsl:copy>
</xsl:template>
Remarkably, we are not doing anything with <collection>
here - we want to wrap each top level tuple
(and its children) in <collection>
, so we need to add <collection>
at the level tuple
.
<!-- Add the `collection` wrapper only to top-level tuples -->
<xsl:template match="tuple" mode="top">
<collection>
<!-- Pass on this tuple to the main `tuple` template -->
<xsl:apply-templates select="."/>
</collection>
</xsl:template>
I'm matching tuple
, but using special mode - we only want to add a wrapper <collection>
for the top level tuple
s. Then I have a separate template that handles all tuple
s:
<!-- This is the main template for processing `tuple` elements.
Most of the changes needed are common to all `tuple`s, so it
makes sense to keep all the logic in one place. -->
<xsl:template match="tuple">
<tuple>
<!-- Copy each existing `atom` child -->
<xsl:copy-of select="atom"/>
<!-- Add in metadata about parent and top-level ancestor titles and IDs -->
<xsl:choose>
<!-- If this is a top-level item, just use its own values -->
<xsl:when test="not(tuple/*)">
<atom name="TopTitle"><xsl:value-of select="atom[@name='EADUnitTitle']"/></atom>
<atom name="TopID"><xsl:value-of select="atom[@name='irn']"/></atom>
</xsl:when>
<!-- If this is a descendant, we need to find its parent and its top-level ancestor -->
<xsl:when test="tuple/*">
<atom name="ParentTitle"><xsl:value-of select="tuple/atom[@name='EADUnitTitle']"/></atom>
<atom name="ParentID"><xsl:value-of select="tuple/atom[@name='irn']"/></atom>
<!-- For convenience, grab the top-level ancestor `tuple` and stuff it in a variable.
This is vaguely annalogous to your use of `key`. -->
<!-- Finding the top-level `tuple` is complicated by the fact that the ID values in
`<atom name="irn">` do not have a standardized format, other than that the whole
strings appear to consist of atomic values separated by single periods, with
descendant `irn` values appending to the precedent values. Examples:
Top: `da.04`
Descendant: `da.04.11.02`
Top: `D42.PL.05`
Descendant: `D42.PL.05.01`
So chunking the ID values is a problematic approach, since we don't know how many
chunks comprise the initial non-numeric portion: `da`, or `D42.PL`, or ... ???.
Top-level elements *do* also have empty `<tuple name="AssParentObjectRef">` elements.
So we _can_ find all the top-level elements, and then look in those for the one that
has an `irn` value that matches the start of the `irn` value of this current `tuple`. -->
<xsl:variable name="top" select="/table/tuple[tuple[@name='AssParentObjectRef'][not(*)]]
['The above statement grabs all the `tuple`s that have an empty `tuple[@name=`AssParentObjectRef``.
The below statement then goes through all those `tuple`s to find the ones where the `irn`
values match the start of the `irn` value of the current `tuple`.']
[starts-with(current()/atom[@name='EADUnitID'], atom[@name='EADUnitID'])]"/>
<!-- Now we can reference that variable to get the top-level ancestor values -->
<atom name="TopTitle"><xsl:value-of select="$top/atom[@name='EADUnitTitle']"/></atom>
<atom name="TopID"><xsl:value-of select="$top/atom[@name='irn']"/></atom>
</xsl:when>
</xsl:choose>
<!-- Process any children of this tuple, based on `irn` values.
Basically, we look for any other `tuple`s in the `table`
that point to this current `tuple` `irn` value. -->
<xsl:if test="/table/tuple[tuple/atom[@name='irn'] = current()/atom[@name='irn']]">
<tuple name="children">
<xsl:apply-templates select="/table/tuple[tuple/atom[@name='irn'] = current()/atom[@name='irn']]"></xsl:apply-templates>
</tuple>
</xsl:if>
</tuple>
</xsl:template>
I cannot speak to your complete dataset of 3524 records, but by doing the above against your sample XML input, my output is identical to your desired output (except for one error in your XML input example about a reference irn
value mentioned in a comment on your original message).
Suggestions for the desired XML output
As a data format, the desired output XML has a number of aspects that strike me as a bit odd.
-
Wrapper
<collection>
This seems overkill; it is easy enough to see if a given tuple is 1) top level and 2) has children. -
Headers and IDs
Parent
andTop
They seem overkill too. As long as your data is structured hierarchically, it all becomes clear without any need to specifically include it. Including this metadata simply duplicates the information you already have. -
Wrapper
<tuple name="children">
This can also be omitted without losing information. The mere presence of one elementtuple
nested within another is enough children.
I don't know if you have any control over the design or influence of the output XML file format, but if you do, I would suggest a shorter and more ordered structure such as:
<table name="ecatalogue">
<tuple>
<atom name="irn">2470</atom>
<atom name="EADUnitID">da.01</atom>
<atom name="EADUnitTitle">Some title</atom>
<tuple>
<atom name="irn">5416</atom>
<atom name="EADUnitID">da.01.01</atom>
<atom name="EADUnitTitle">Child of Some title</atom>
</tuple>
<tuple>
<atom name="irn">6</atom>
<atom name="EADUnitID">da.01.02</atom>
<atom name="EADUnitTitle">Child of Some title 2</atom>
<tuple>
<atom name="irn">8</atom>
<atom name="EADUnitID">da.01.02.01</atom>
<atom name="EADUnitTitle">3rd Generation</atom>
</tuple>
</tuple>
</tuple>
</table>
In this structure, a tuple
can only contain atom
or other tuple
s. We can identify a collection by simply finding any tuple
top level that contains other tuple
s. We can identify children by simply finding anyone tuple
who has a parent tuple
. We can find the top level and parent level headers and IDs by simply selecting tuple
and looking further down the element tree.
This structure is simpler, avoids data duplication, and is arguably clearer and easier to process. However, you know your needs! Do what works for you. :)
Please go through the code and comments, and let me know if you have any lingering questions.
Update 2017-06-15: issues
I added these to your previous XML data sample and tried to apply my previous XSL code. When looking at the problematic tuples, two things happened:
-
My previous code uses a feature
tokenize
that is only available in XSL 2.0 and later. Your post tags didn't specify XSL 1.0, and I didn't notice that you specified that in your example XSL header.I reworked the XSL code above (template that matches
tuple
) to rely only on XSL 1.0 features. -
The values
EADUnitID
in the original XML input sample are not representative, so any attempt to code specifically for that sample will fail when applied to your full, invisible input XML.Your original sample includes only
EADUnitID
format valuesda.XX
, whereXX
are numbers, and the pattern.XX
can be repeated. I've made some assumptions on how to trick this string and compare shapes.However, your tuples have values in a
EADUnitID
very different format, which looks likeDXX.PL.XX
where againXX
are numbers, and the pattern.XX
at the end may be repeated. This means that relying on chunk matching between periods is not a safe approach.I reworked the XSL code to fit the entire line front instead
irn
, which works reliably.
Take a look at the code and comments, and let me know if anything remains unclear or non-functional.
source to share