Perl libXML find node by attribute value
I have a very large XML document that I am iterating over. XML uses mostly attributes, not node values. I may need to find many files in a file to collect one group of information. They are related to each other using different ref tag values. Currently, every time I need to find one of the nodes to retrieve data, I iterate over the entire XML and match the attribute to find the correct node. Is there a more efficient way to just select the node of a given attribute value instead of a canned loop and comparison? My current code is so slow that it is almost useless.
I am currently doing something like this many times in the same file for many different nodes and attribute combinations.
my $searchID = "1234";
foreach my $nodes ($xc->findnodes('/plm:PLMXML/plm:ExternalFile')) {
my $ID = $nodes->findvalue('@id');
my $File = $nodes->findvalue('@locationRef');
if ( $searchID eq $ID ) {
print "The File Name = $File\n";
}
}
In the above example, I am looping and using "if" comparison to match ids. I was hoping I could do something like this below to just map a node attribute by attribute ... and would it be more efficient than a loop?
my $searchID = "1234";
$nodes = ($xc->findnodes('/plm:PLMXML/plm:ExternalFile[@id=$searchID]'));
my $File = $nodes->findvalue('@locationRef');
print "The File Name = $File\n";
source to share
Go through one pass to extract the information you want into a more convenient format or create an index.
my %nodes_by_id;
for my $node ($xc->findnodes('//*[@id]')) {
$nodes_by_id{ $node->getAttribute('id') } = $node;
}
Then your loops will become
my $node = $nodes_by_id{'1234'};
(And stop using findvalue
instead getAttribute
.)
source to share
If you will be doing this for a lot of identifiers, then ikegami's answer is worth reading.
I was hoping I could do something like this below to just match node by attribute
...
$nodes = ($xc->findnodes('/plm:PLMXML/plm:ExternalFile[@id=$searchID]'));
Sorting.
For a given id, yes, you can do
$nodes = $xc->findnodes("/plm:PLMXML/plm:ExternalFile[\@id=$searchID]");
... provided that $searchID
is known to be numeric. Note that double quotes in perl means the variables are interpolated, so you should avoid @id
because it is part of a literal string, not a perl array, whereas you want the value to $searchID
become part of the xpath string, so it won't escape.
Note that in this case you are requesting it in a scalar context, there will be an XML :: LibXML :: Nodelist object, not an actual node, not an arrayref; for the latter, you will need to use square brackets instead of parentheses, as I did in the following example.
Alternatively, if your search ID cannot be numeric, but you know for sure that it is safe to fit into an XPath string (for example, it has no quotes), you can do the following:
$nodes = [ $xc->findnodes('/plm:PLMXML/plm:ExternalFile[@id="' . $searchID . '"]') ];
print $nodes->[0]->getAttribute('locationRef'); # if you're 100% sure it exists
Note that the resulting string will enclose the value in quotes.
Finally, you can skip straight ahead:
print $xc->findvalue('/plm:PLMXML/plm:ExternalFile[@id="' . $searchID . '"]/@locationRef');
... if you know there is only one node with this id.
source to share
If you have a DTD for your document that declares an attribute id
as a DTD id
, and you make sure the DTD is read when parsing the document, you can efficiently refer to elements with a specific ID via $doc->getElementById($id)
.
source to share
I think you just need to learn a little bit about XPath expressions. For example, you can do something like this:
my $search_id = "1234";
my $query = "/plm:PLMXML/plm:ExternalFile/[\@id = '$search_id']";
foreach my $node ($xc->findnodes($query)) {
# ...
}
In an XPath expression, you can also combine multiple attribute checks, for example:
[@id = '$search_id' and contains(@pathname, '.pdf')]
One XPath tutorial a lot
source to share