XSLT XML search using regex, word boundries
Can I use a regular expression to search XML content using XSLT? I can search for nodes using contains, but I need to use word constraints (for example /\bmy phrase\b/i
) to search for a phrase, not just a single word.
When searching for "blood pressure" using the following, all nodes with "blood", "pressure" and "blood pressure" are returned.
I want to return nodes containing "blood pressure". Using PHP preg_match I can achieve this using:/\b$keywords\b/i
<xsl:template match="//item">
<xsl:when test="contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword) or contains(translate(content, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword)">
<title><xsl:value-of select="title"/></title>
<content><xsl:value-of select="content"/></content>
<date><xsl:value-of select="date"/></date>
<author><xsl:value-of select="author"/></author>
I. You can do something like this in XSLT 2.0 :
<xsl:stylesheet version="2.0"
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="s">
<xsl:variable name="vWords" select=
) [.]
<xsl:sequence select=
" for $current in .,
$i in 1 to count($vWords)
if($vWords[$i] eq 'blood'
$vWords[$i+1] eq 'pressure'
then .
else ()
<xsl:template match="text()"/>
When this XSLT 2.0 transform is applied to the following XML document (there is no such document in this question !!!):
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
<s>He was coldblood Pressured.</s>
desired, correct result (only items containing "blood" and "pressure" (case insensitive and as two adjacent words):
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Using a function
to split nn-alphabetic characters into strings with flags for case insensitivity and multiline mode. -
Iterate through the result
to find a word"blood"
immediately followed by a word"pressure"
II. XSLT 1.0 solution :
<xsl:stylesheet version="1.0"
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vUpper" select=
<xsl:variable name="vLower" select=
<xsl:variable name="vSpaaaceeees" select=
"' '
<xsl:variable name="vAlpha" select="concat($vLower, $vUpper)"/>
<xsl:template match="s">
<xsl:variable name="vallLower" select="translate(., $vUpper, $vLower)"/>
<xsl:copy-of select=
(' ',
(translate($vallLower, translate($vallLower, $vAlpha, ''), $vSpaaaceeees)),
' '
' blood pressure '
<xsl:template match="text()"/>
when this transformation is applied to the same XML document (above), the same correst result is obtained :
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Convert to lowercase.
Using the double-translate method to replace any non-alpha character with a space.
Then using
to replace any group of contiguous spaces with one space. -
Then surrounding this result with spaces.
Finally, check if the current result contains a string
" blood pressure "
XSLT and XPath 2.0 have a match function that supports regular expressions, XSLT and XPath 1.0 don't have such a function, you will need to use an extension function supported by your XSLT processor: http://www.exslt.org/regexp/functions/match/ index.html . However, even with XSLT / XPath 2.0, I think the supported regex language does not support the "word boundary" pattern.
The regular expressions used by this instruction, and the flags that control the interpretation of these regular expressions, must conform to the syntax defined in [Functions and Operators] (see Section 7.6.1 Regular Expression Syntax ), which itself is based on the syntax defined in [XML Schema Part 2] .
The first link from the quote shows us the absence \b
Ditto for the second link Single Escape character
But if we scroll a little in the last document, we find character classes ( Category Escape
). And use a combination of classes punctuation
and space
: [\p{P}\p{Z}]
to achieve a similar effect.