Different behavior of the same regex in Python and Java

First, my apologies as I don't know regexes that are good.

I am using regex to match a string. I tested it in the Python command line interface, but when I ran it in Java, it gave a different result.

Python execution:

re.search("[0-9]*[\\.[0-9]+]?[^0-9]*D\\([M|W]\\)\\s*US", "9.5 D(M) US");

      

gives the result as:

<_sre.SRE_Match object; span=(0, 11), match='9.5 D(M) US'>

      

But Java code

import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class RegexTest {
    private static final Pattern FALLBACK_MEN_SIZE_PATTERN = Pattern.compile("[0-9]*[\\.[0-9]+]?[^0-9]*D\\([M|W]\\)\\s*US");

    public static void main(String[] args) {
    String strTest = "9.5 D(M) US";
    Matcher matcher = FALLBACK_MEN_SIZE_PATTERN.matcher(strTest);
        if (matcher.find()) {
            System.out.println(matcher.group(0));
        }
    }
}

      

gives the result as:

5 D (M) US

I don’t understand why he behaves differently.

+3


source to share


2 answers


Here's a template that will work the same in Java and Python:

"[0-9]*(?:\\.[0-9]+)?[^0-9]*D\\([MW]\\)\\s*US"

      

See Python and Java demos.



In Python [\\.[0-9]+]?

reads as 2 subpatterns: [\.[0-9]+

(1 or more .

s, [

s, or numbers) and ]?

(0 or 1 ]

). See how your regular expression works in Python here . Or, in more detail with capture groups, here .

In Java, it is read as a single character class (i.e., [

and is ignored]

internally ), since they cannot be handled correctly by the regex engine, so the entire subpattern standing for 0 or 1 is digit or ), and since it is optional, it doesn't grab anything (you can get a visual hint of Visual Regex Tester , enter as input and as regex)..

+

123.+[]

[\.[0-9]+]?

And the last touch: [M|W]

means M

, |

or W

, whereas I think you meant [MW]

= M

or W

.

+4


source


I'm not a Python expert, so I can't tell why it worked in Python, but in Java, your problem is part of it [\\.[0-9]+]?

. You probably meant (\\.[0-9]+)?

.

Be that as it may, this is a list of characters within []

, followed by ?

. That is, this part of the expression matches only one or a null character, so it cannot match .5

.

Here's an illustration of the matching attempts:



Graphical demonstration of matching in Java

Now, if your template used ()

instead []

, this would be the result:

enter image description here

+1


source







All Articles