Regex \ b in java and javascript

Question

Regex \ b in java and javascript

Is there any difference in using the \ b regex in java and js?
I've tried the test below:
in javascript:

console.log(/\w+\b/.test("test中文"));//true

in java:

String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
    System.out.println("matched");//never executed
}

Why is the result of the two examples above not the same?

+3

java javascript regex

Gary chen May 24 '15 at 15:29

source to share

2 answers

The Jeva regex searches for a sequence of word characters, i.e. [a-zA-Z_0-9]+

preceding the word boundary. But 中文 doesn't work \w

. If you use only \\b

, you will find two matches: the beginning and the end of the line.

As georg pointed out, Javascript does not interpret characters in the same way as the Java Regex engine.

+1

laune May 24 '15 at 15:43

source to share

Pshemo · Accepted Answer · 2015-05-24T16:10:27+0000

This is because, by default, Java supports Unicode for \b

, but not for\w

, while JavaScript does not support Unicode for both.

So \w

can only match characters [a-zA-Z0-9_]

(in our case test

) but \b

cannot take place (flagged |

)

test|中文

both between alphabetic and non-alphabetic Unicode standards, since both t

, and 中

are considered alphabetic by Unicode characters.

If you want to have \b

one that ignores Unicode, you can use the search engine and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w))

, or in the case of this example, a simple one will do (?!\\w)

instead \\b

.

If you want \w

Unicode support as well, compile your template with a flag Pattern.UNICODE_CHARACTER_CLASS

(which can also be written as a flag expression (?U)

)

Regex \ b in java and javascript

More articles: