XSLT XML search using regex, word boundries

Question

XSLT XML search using regex, word boundries

Can I use a regular expression to search XML content using XSLT? I can search for nodes using contains, but I need to use word constraints (for example /\bmy phrase\b/i

) to search for a phrase, not just a single word.

When searching for "blood pressure" using the following, all nodes with "blood", "pressure" and "blood pressure" are returned.

I want to return nodes containing "blood pressure". Using PHP preg_match I can achieve this using:/\b$keywords\b/i

<xsl:template match="//item">
    <xsl:choose>
        <xsl:when test="contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword) or contains(translate(content, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword)">
            <item>
                <title><xsl:value-of select="title"/></title>
                <content><xsl:value-of select="content"/></content>
                <date><xsl:value-of select="date"/></date>
                <author><xsl:value-of select="author"/></author>
            </item>
        </xsl:when>
    </xsl:choose>
</xsl:template>

+3

xml php xslt

rossjha 11 Mar 12 at 15:08

source to share

3 answers

XSLT and XPath 2.0 have a match function that supports regular expressions, XSLT and XPath 1.0 don't have such a function, you will need to use an extension function supported by your XSLT processor: http://www.exslt.org/regexp/functions/match/ index.html . However, even with XSLT / XPath 2.0, I think the supported regex language does not support the "word boundary" pattern.

0

Martin Honnen 11 Mar 12 at 15:21

source to share

http://www.w3.org/TR/xslt20/#regular-expressions

The regular expressions used by this instruction, and the flags that control the interpretation of these regular expressions, must conform to the syntax defined in [Functions and Operators] (see Section 7.6.1 Regular Expression Syntax ), which itself is based on the syntax defined in [XML Schema Part 2] .

The first link from the quote shows us the absence \b

.

Ditto for the second link Single Escape character

But if we scroll a little in the last document, we find character classes ( Category Escape

). And use a combination of classes punctuation

and space

: [\p{P}\p{Z}]

to achieve a similar effect.

0

kirilloid 11 Mar 12 at 15:36

source to share

Dimitre Novatchev · Accepted Answer · 2012-03-11T16:07:15+0000

I. You can do something like this in XSLT 2.0 :

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="s">
  <xsl:variable name="vWords" select=
  "tokenize(lower-case(string(.)),
            '[\s.?!,;—:\-]+'
            ) [.]
  "/>
  <xsl:sequence select=
   " for $current in .,
         $i in 1 to count($vWords)
     return
        if($vWords[$i] eq 'blood'
          and
           $vWords[$i+1] eq 'pressure'
           )
           then .
           else ()
  "/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

When this XSLT 2.0 transform is applied to the following XML document (there is no such document in this question !!!):

<t>
 <s>He has high blood pressure.</s>
 <s>He has high Blood Pressure.</s>
 <s>He has high Blood
 Pressure.</s>

  <s>He was  coldblood Pressured.</s>

</t>

desired, correct result (only items containing "blood" and "pressure" (case insensitive and as two adjacent words):

<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
 Pressure.</s>

Explanation

Using a function tokenize()

to split nn-alphabetic characters into strings with flags for case insensitivity and multiline mode.
Iterate through the result tokenize()

to find a word "blood"

immediately followed by a word "pressure"

.

II. XSLT 1.0 solution :

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vUpper" select=
 "'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>

 <xsl:variable name="vLower" select=
 "'abcdefghijklmnopqrstuvwxyz'"/>

 <xsl:variable name="vSpaaaceeees" select=
 "'                                                                               '
 "/>

 <xsl:variable name="vAlpha" select="concat($vLower, $vUpper)"/>

 <xsl:template match="s">
   <xsl:variable name="vallLower" select="translate(., $vUpper, $vLower)"/>
     <xsl:copy-of select=
     "self::*
       [contains
        (concat
         (' ',
          normalize-space
           (translate($vallLower, translate($vallLower, $vAlpha, ''), $vSpaaaceeees)),
          ' '
          ),

         ' blood pressure '
         )
       ]
  "/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

when this transformation is applied to the same XML document (above), the same correst result is obtained :

<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
 Pressure.</s>

Explanation

Convert to lowercase.
Using the double-translate method to replace any non-alpha character with a space.
Then using normalize-space()

to replace any group of contiguous spaces with one space.
Then surrounding this result with spaces.
Finally, check if the current result contains a string " blood pressure "

.

XSLT XML search using regex, word boundries

More articles: