Regular expression for invalid characters in XML

I am trying to figure out a way that I can find all invalid characters in XML. According to the W3 recommendation , these are valid characters in XML:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

      

Converting it to decimal:

9
10
13
32-55295
57344-65533
65536-1114111

      

- valid xml symbols.

I am trying to search in notepad ++ using the appropriate regex for invalid characters.

An excerpt from my XML:

        <custom-attribute attribute-id="isContendFeed">fal &#11; se</custom-attribute>
        <custom-attribute attribute-id="pageNoFollow">fal &#3; se</custom-attribute>
        <custom-attribute attribute-id="pageNoIndex">fal &#13; se</custom-attribute>
        <custom-attribute attribute-id="rrRecommendable">false</custom-attribute>

      

In the above example, I want my regular expression is found &#11;

, and &#3;

for me, because it is forbidden in XML.

I cannot build a regex for this.

The regex I made for numeric ranges:

32-55295 : (3[2-9]|[4-9][0-9]|[1-9][0-9]{2,3}|[1-4][0-9]{4}|5[0-4][0-9]{3}|55[01][0-9]{2}|552[0-8][0-9]|5529[0-5])
57344-65533 : (5734[4-9]|573[5-9][0-9]|57[4-9][0-9]{2}|5[89][0-9]{3}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-3])
65536-1114111 : (6(5(5(3[6-9]|[4-9][0-9])|[6-9][0-9]{2})|[6-9][0-9]{3})|[7-9][0-9]{4}|[1-9][0-9]{5}|1(0[0-9]{5}|1(0[0-9]{4}|1([0-3][0-9]{3}|4(0[0-9]{2}|1(0[0-9]|1[01])))))))

      

This regex works when used alone, but I can't seem to execute the full regex.

Is there any other way other than a regular expression that I can use to find invalid characters? If not, please help me in constructing a regex that can find invalid characters present in my XML.

+3


source to share


1 answer


first, literal text is &#3;

allowed in xml - not allowed (if list is correct) is a character with ascii code 3. Hope I got it right.

Secondly. Most regex flavors allow you to search for characters that can be specified using \x00

(two hex digits) and \u0000

(4 hex digits). Some flavors allow something like \x{...}

- but it differs from the taste by the flavor ...

Let's start with

[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]

[^]

defines negative character sets and character ranges (or more). Just fill it in with all valid characters and ranges.

If your taste understands \x{}

, it is easy to expand.

[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]

      



Otherwise, you will have to search for surrogate pairs by symbol ...

\x{10000}

coincides with \uD800\uDC00

\x{10FFFF}

coincides with \uDBFF\uDFFF

This cannot be done in one set. Not fun;) It's kind of a negative version

[\uD800-\uDBFF][\uDC00-\uDFFF]|
[\uD800-\uDBFF](?![\uDC00-\uDFFF])|
(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]

      

(from https://mathiasbynens.be/notes/javascript-unicode#matching-code-points )

+1


source







All Articles