Java.util.Scanner useDelimiter (") or useDelimiter (Pattern.compile (" \\ s ")) differ from standard behavior

Given the code below, it outputs:

Feed a chunk of data here:           
I have found:   0 words; 0 ints; 0 booleans;

      

if I type 10 spaces and leave the two handlers for the useDelimiter method comments and outputs:

Feed a chunk of data here:           
I have found:   9 words; 0 ints; 0 booleans;
sssssssss

      

if I find the same 10 spaces, but use one of two attempts at useDelimiter. Why is that? Shouldn't be the same? Here is the code, thanks:

package com.riccardofinazzi.regex;

import java.io.Console;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.ArrayList;

class ScanNext {
    public static void main(String[] args) {

        /* match counters */
        int hits_s = 0, hits_i = 0, hits_b = 0;

        /* current token value */
        String  s;
        Integer i;
        Boolean b;

        ArrayList<Object> al = new ArrayList<Object>();

        Scanner s1 = new Scanner(System.console().readLine("Feed a chunk of data here: "));

        /* not needed as this is def behaviour, I put it here to not forget the method */

        //s1.useDelimiter(Pattern.compile("\\s"));
        //s1.useDelimiter(" ");

        while(s1.hasNext()) {
            if (        s1.hasNextInt()) {
                        al.add(s1.nextInt());       hits_i++;

            } else if ( s1.hasNextBoolean()) {
                        al.add(s1.nextBoolean());   hits_b++;

            } else {    al.add(s1.next());          hits_s++;
            }
        }

        System.out.println("I have found:\t"+hits_s+" words; "+hits_i+" ints; "+hits_b+" booleans;");

        for (Object in : al) {
            if (in instanceof String)
                System.out.print("s");
            if (in instanceof Integer)
                System.out.print("i");
            if (in instanceof Boolean)
                System.out.print("b");
        }
        System.out.print("\n");
    }
}

      

+3


source to share


3 answers


Let's assume that X

is a delimiter.

If we scan the text as "aXbXc"

, then it is clear that there are 3 tokens: "a"

"b"

and "c"

.

If we scan the text, for example "aXXc"

, there are still 3 tokens, but this time: "a"

""

and "c"

. This is because we are limiting the delimiter to only one X

at a time, so it does not see the other X

as a continuation of an already matched delimiter, but as a separate one.
(This is very useful in cases where there is a delimiter ,

and we are scanning data such as 1,2,,,3

because they are supposed to represent items:) 1

2

noData

noData

3

.
If you want the delimiter to represent one or more X

, you would need to use X+

because it +

is a quantifier representing "one or more times". Thus, aXXc

will only represent elements "a"

and "c"

since integersXX

will be treated as one separator.

Another interesting case is aXbX

. As you can see, not here c

, the text ends with a separator . In this case, the Scanner does not assume that there is an empty element after the last delimiter, so it only sees tags "a"

and "b"

as tokens, not "a", "b", ""

.

The same applies to XbXc

where the text starts with a separator . The scanner does not assume there is an empty item in front of it.




Now back to your case.

If you print the default delimiter for the scanner (using the type code System.out.println(s1.delimiter());

), you can see what it is \p{javaWhitespace}+

. Thus, the default separator is one or more spaces . But later, you change it to a single space or family of spaces. This means that for the line

"          "

      

  • if delimiter \p{javaWhitespace}+

    , then the whole expression matches as one delimiter, so there are no elements before, after and between delimiters, so there are 0 tokens (no delimiter elements)
  • but if we use " "

    or "\\s"

    as a separator, then the scanner will find 10 separators (each one is one of them). Since there are 10 delimiters, this means there are 9 items in between (even blank lines). Also, the text starts and ends with a separator, which means there are no tokens before the first separator or after the last.
+4


source


I read part of the Scanner Documentation which says, among other things:

Empty tokens may be returned depending on the type of delimiter template. For example, the pattern "\ s +" will not return empty tokens because it matches multiple instances of the separator. The "\ s" delimiter pattern can return empty tokens because it only skips one space at a time.



The reason for the observed behavior is the default delimiter, which is \\p{javaWhitespace}+

, as you can see in Scanner.WHITESPACE_PATTERN

( code from OpenJDK ) and Scanner.reset()

(which resets the delimiter to this pattern). Because of +

it it matches your input as one delimiter.

If you change your custom separators by adding at the end +

, they will also treat consecutive spaces as a single separator.

+3


source


None of the two whitespace patterns you tried to match with the default delimiter, which is "\\p{javaWhitespace}+"

. the documentation doesn't make it clear: it just says, "The scanner splits its input into tokens using the delimiter pattern, which defaults to a space." A simple "space" means any number of consecutive space characters.

The defining specification of the default delimiter is only listed in the Scanner.reset () documentation , which resets the default delimiter.

+1


source







All Articles