Split string with regex \ w \ w *? \ W +?

I am learning regexp and think I am starting to miss. but then ...

I tried to split the line and I need help to understand something as simple as:

String input = "abcde";
System.out.println("[a-z] " + Arrays.toString(input.split("[a-z]")));
System.out.println("\\w " + Arrays.toString(input.split("\\w")));
System.out.println("\\w*? " + Arrays.toString(input.split("\\w*?")));
System.out.println("\\w+? " + Arrays.toString(input.split("\\w+?")));

The output is
[a-z] - []
\w    - []
\w*?  - [, a, b, c, d, e]
\w+?  - []

      

Why do neither of the first two lines break the String on any character? The third expression \ w *? (The question mark prevents greed) works as I expected, splitting the String at each character. An asterisk, zero or more matches returns an empty array.

I tried the expression in NotePad ++ and in the program and it shows 5 matches as in:

Scanner ls = new Scanner(input);
while(ls.hasNext())
    System.out.format("%s ", ls.findInLine("\\w");

Output is: a b c d e

      

It really puzzles me

+3


source to share


3 answers


If you split a string with a regex, you are essentially telling where the string should be split. This will make sure to disable whatever you are matching with the regexp. This means that if you split by \w

, then each character is a split point and substrings between them (all empty) are returned. Java automatically removes trailing blank lines as described in the documentation .

This also explains why lazy matching \w*?

will give you every character, because it will match every position between (and before and after) any character (zero-width). The rest of the characters of the strings themselves.



Let's break it down:

  • [a-z]

    , , \w

    \w+?

    Your string

    abcde
    
          

    And the matches look like this:

     a  b  c  d  e
    β””β”€β”˜β””β”€β”˜β””β”€β”˜β””β”€β”˜β””β”€β”˜
    
          

    which leaves you with substrings between matches, all of which are empty.

    The above three regexes behave the same in this regard as they will all match only one character. \w+?

    will do this because it lacks any other constraints that would attempt to +?

    match more than just the minimum (it is lazy, after all).

  • \w*?

      a  b  c  d  e
    β””β”˜ β””β”˜ β””β”˜ β””β”˜ β””β”˜ β””β”˜
    
          

    In this case, the matches are between characters, leaving you with the following substrings:

    "", "a", "b", "c", "d", "e", ""
    
          

    Java discards the remaining blank.

+9


source


Let's break each of these calls into String#split(String)

. It is a key to note from the Java docs that "the method works as if invoking a method to split two arguments with a given expression and a zero-based limit argument. Thus, trailing empty strings are not included in the resulting array."

"abcde".split("[a-z]"); // => []

      

This matches every character (a, b, c, d, e) and only outputs blank lines in between that are omitted.

"abcde".split("\\w")); // => []

      

Again, every character in the string is a word character ( \w

), so the result is empty strings that are omitted.



"abcde".split("\\w*?")); // => ["", "a", "b", "c", "d", "e"]

      

In this case, *

means "zero or more of the previous item" ( \w

), which matches the empty expression seven times (once at the beginning of the line, then once between each character). So we get the first blank line and then each character.

"abcde".split("\\w+?")); // => []

      

Here +

means "one or more of the previous item" ( \w

) that matches the entire input string, resulting in only the empty string being omitted.

Repeat these examples with input.split(regex, -1)

and you should see all blank lines.

+2


source


String.split

slices the string on every pattern match:

The array returned by this method contains each substring of this string that ends with another substring that matches the given expression or ends at the end of the string.

So, whenever a pattern like this [a-z]

is matched, the string is cut at that match. Since every character in your string matches a pattern, the resulting array is empty (empty lines are removed).

The same applies to \w

and \w+?

(one or more \w

, but as few reps as possible). What \w*?

results in something you expected is due to the quantifier *?

, since it will match zero repetitions if possible, so an empty string. And an empty string is found at every position in the given string.

+1


source







All Articles