XSLT XML search using regex, word boundries
Can I use a regular expression to search XML content using XSLT? I can search for nodes using contains, but I need to use word constraints (for example /\bmy phrase\b/i
) to search for a phrase, not just a single word.
When searching for "blood pressure" using the following, all nodes with "blood", "pressure" and "blood pressure" are returned.
I want to return nodes containing "blood pressure". Using PHP preg_match I can achieve this using:/\b$keywords\b/i
<xsl:template match="//item">
<xsl:choose>
<xsl:when test="contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword) or contains(translate(content, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword)">
<item>
<title><xsl:value-of select="title"/></title>
<content><xsl:value-of select="content"/></content>
<date><xsl:value-of select="date"/></date>
<author><xsl:value-of select="author"/></author>
</item>
</xsl:when>
</xsl:choose>
</xsl:template>
source to share
I. You can do something like this in XSLT 2.0 :
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="s">
<xsl:variable name="vWords" select=
"tokenize(lower-case(string(.)),
'[\s.?!,;—:\-]+'
) [.]
"/>
<xsl:sequence select=
" for $current in .,
$i in 1 to count($vWords)
return
if($vWords[$i] eq 'blood'
and
$vWords[$i+1] eq 'pressure'
)
then .
else ()
"/>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
When this XSLT 2.0 transform is applied to the following XML document (there is no such document in this question !!!):
<t>
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Pressure.</s>
<s>He was coldblood Pressured.</s>
</t>
desired, correct result (only items containing "blood" and "pressure" (case insensitive and as two adjacent words):
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Pressure.</s>
Explanation
-
Using a function
tokenize()
to split nn-alphabetic characters into strings with flags for case insensitivity and multiline mode. -
Iterate through the result
tokenize()
to find a word"blood"
immediately followed by a word"pressure"
.
II. XSLT 1.0 solution :
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vUpper" select=
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="vLower" select=
"'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="vSpaaaceeees" select=
"' '
"/>
<xsl:variable name="vAlpha" select="concat($vLower, $vUpper)"/>
<xsl:template match="s">
<xsl:variable name="vallLower" select="translate(., $vUpper, $vLower)"/>
<xsl:copy-of select=
"self::*
[contains
(concat
(' ',
normalize-space
(translate($vallLower, translate($vallLower, $vAlpha, ''), $vSpaaaceeees)),
' '
),
' blood pressure '
)
]
"/>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
when this transformation is applied to the same XML document (above), the same correst result is obtained :
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Pressure.</s>
Explanation
-
Convert to lowercase.
-
Using the double-translate method to replace any non-alpha character with a space.
-
Then using
normalize-space()
to replace any group of contiguous spaces with one space. -
Then surrounding this result with spaces.
-
Finally, check if the current result contains a string
" blood pressure "
.
source to share
XSLT and XPath 2.0 have a match function that supports regular expressions, XSLT and XPath 1.0 don't have such a function, you will need to use an extension function supported by your XSLT processor: http://www.exslt.org/regexp/functions/match/ index.html . However, even with XSLT / XPath 2.0, I think the supported regex language does not support the "word boundary" pattern.
source to share
http://www.w3.org/TR/xslt20/#regular-expressions
The regular expressions used by this instruction, and the flags that control the interpretation of these regular expressions, must conform to the syntax defined in [Functions and Operators] (see Section 7.6.1 Regular Expression Syntax ), which itself is based on the syntax defined in [XML Schema Part 2] .
The first link from the quote shows us the absence \b
.
Ditto for the second link Single Escape character
But if we scroll a little in the last document, we find character classes ( Category Escape
). And use a combination of classes punctuation
and space
: [\p{P}\p{Z}]
to achieve a similar effect.
source to share