Regex \ b in java and javascript

Is there any difference in using the \ b regex in java and js?
I've tried the test below:
in javascript:

console.log(/\w+\b/.test("test中文"));//true  

      

in java:

String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
    System.out.println("matched");//never executed
}

      

Why is the result of the two examples above not the same?

+3


source to share


2 answers


This is because, by default, Java supports Unicode for \b

, but not for\w

, while JavaScript does not support Unicode for both.

So \w

can only match characters [a-zA-Z0-9_]

(in our case test

) but \b

cannot take place (flagged |

)

test|中文

      



both between alphabetic and non-alphabetic Unicode standards, since both t

, and

are considered alphabetic by Unicode characters.

If you want to have \b

one that ignores Unicode, you can use the search engine and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w))

, or in the case of this example, a simple one will do (?!\\w)

instead \\b

.

If you want \w

Unicode support as well, compile your template with a flag Pattern.UNICODE_CHARACTER_CLASS

(which can also be written as a flag expression (?U)

)

+3


source


The Jeva regex searches for a sequence of word characters, i.e. [a-zA-Z_0-9]+

preceding the word boundary. But 中文 doesn't work \w

. If you use only \\b

, you will find two matches: the beginning and the end of the line.



As georg pointed out, Javascript does not interpret characters in the same way as the Java Regex engine.

+1


source







All Articles