Regex \ b in java and javascript
Is there any difference in using the \ b regex in java and js?
I've tried the test below:
in javascript:
console.log(/\w+\b/.test("test中文"));//true
in java:
String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("matched");//never executed
}
Why is the result of the two examples above not the same?
source to share
This is because, by default, Java supports Unicode for \b
, but not for\w
, while JavaScript does not support Unicode for both.
So \w
can only match characters [a-zA-Z0-9_]
(in our case test
) but \b
cannot take place (flagged |
)
test|中文
both between alphabetic and non-alphabetic Unicode standards, since both t
, and 中
are considered alphabetic by Unicode characters.
If you want to have \b
one that ignores Unicode, you can use the search engine and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w))
, or in the case of this example, a simple one will do (?!\\w)
instead \\b
.
If you want \w
Unicode support as well, compile your template with a flag Pattern.UNICODE_CHARACTER_CLASS
(which can also be written as a flag expression (?U)
)
source to share
The Jeva regex searches for a sequence of word characters, i.e. [a-zA-Z_0-9]+
preceding the word boundary. But 中文 doesn't work \w
. If you use only \\b
, you will find two matches: the beginning and the end of the line.
As georg pointed out, Javascript does not interpret characters in the same way as the Java Regex engine.
source to share