How to delimit both "=" and "==" in Java when reading

I want to be able to output both "==" and "=" as tokens.

For example, the input text file:

biscuit==cookie apple=fruit+-()

      

Output:

biscuit
=
=
cookie
apple
=
fruit
+
-
(
)

      

I want the output to be:

biscuit
==
cookie
apple
=
fruit
+
-
(
)

      

Here is my code:

    Scanner s = null;
    try {
        s = new Scanner(new BufferedReader(new FileReader("input.txt")));
        s.useDelimiter("\\s|(?<=\\p{Punct})|(?=\\p{Punct})");

        while (s.hasNext()) {

            String next = s.next();
            System.out.println(next);
       }
    } finally {
        if (s != null) {
            s.close();
        }
    }

      

Thank.

Edit: I want to have the current regex.

+3


source to share


4 answers


You may be able to qualify these punctuation with some additional statements.

 # "\\s|(?<===)|(?<=\\p{Punct})(?!(?<==)(?==))|(?=\\p{Punct})(?!(?<==)(?==))"

   \s 
|  (?<= == )
|  (?<= \p{Punct} )
   (?!
        (?<= = )
        (?= = )
   )
|  (?= \p{Punct} )
   (?!
        (?<= = )
        (?= = )
   )

      

Updating information

If some characters are not included in \p{Punct}

, just add them as a separate class to
punctuation subexpressions.



For engines that do not fulfill certain properties within classes use this ->

 #  Raw:   \s|(?<===)|(?<=\p{Punct}|[=+])(?!(?<==)(?==))|(?=\p{Punct}|[=+])(?!(?<==)(?==))

    \s 
 |  (?<= == )
 |  (?<= \p{Punct} | [=+] )
    (?!
         (?<= = )
         (?= = )
    )
 |  (?= \p{Punct} | [=+] )
    (?!
         (?<= = )
         (?= = )
    )

      

For engines that handle properties within classes well, this is the best option ->

 #  Raw:   \s|(?<===)|(?<=[\p{Punct}=+])(?!(?<==)(?==))|(?=[\p{Punct}=+])(?!(?<==)(?==))

    \s 
 |  (?<= == )
 |  (?<= [\p{Punct}=+] )
    (?!
         (?<= = )
         (?= = )
    )
 |  (?= [\p{Punct}=+] )
    (?!
         (?<= = )
         (?= = )
    )

      

+2


source


Just split the input line according to the below expression.

String s = "biscuit==cookie apple=fruit"; 
String[] tok = s.split("\\s+|\\b(?==+)|(?<==)(?!=)");
System.out.println(Arrays.toString(tok));

      

Output:

[biscuit, ==, cookie, apple, =, fruit]

      

Explanation:



  • \\s+

    Matches one or more space characters.
  • |

    OR
  • \\b(?==+)

    Matches a word boundary only if it is followed by a character =

    .
  • |

    OR
  • (?<==)

    Look at the symbol =

    .
  • (?!=)

    And match the border only if it's not followed by a character =

    .

Update:

String s = "biscuit==cookie apple=fruit+-()"; 
String[] tok = s.split("\\s+|(?<!=)(?==+)|(?<==)(?!=)|(?=[+()-])");
System.out.println(Arrays.toString(tok));

      

Output:

[biscuit, ==, cookie, apple, =, fruit, +, -, (, )]

      

+5


source


In other words, you want to split by

  • one or more spaces
  • which has =

    after it and not =

    before it (for example foo|=

    , where |

    this place represents)
  • which has =

    before it and not =

    after it (for example =|foo

    , where |

    this place represents)

In other words

s.useDelimiter("\\s+|(?<!=)(?==)|(?<==)(?!=)");
//             ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^
//cases:         1)        2)        3)

      

Since it looks like you are building a parser, I would suggest using a tool that will allow you to build the correct grammar, such as http://www.antlr.org/ . But if you have to stick with the regex, then another improvement that will make it easier for you to create regex would be to use Matcher#find

instead of the delimiter from the Scanner. So your regex and code might look like

    String data = "biscuit==cookie apple=fruit+-()";

    String regex = "<=|==|>=|[\\Q<>+-=()\\E]|[^\\Q<>+-=()\\E]+";
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(data);

    while (m.find())
        System.out.println(m.group());

      

Output:

biscuit
==
cookie apple
=
fruit
+
-
(
)

      

You can make this regex more general with

String regex = "<=|==|>=|\\p{Punct}|\\P{Punct}+";
//                       ^^^^^^^^^^ ^^^^^^^^^^^-- standard cases
//              ^^ ^^ ^^------------------------- special cases

      

Also this approach would require you to read data from the file first and store it in one line that you parsed. You can find many ways to read text from a file, for example in this question: Reading a Plain Text File in Java

so that you can use something like

String data = new String(Files.readAllBytes(Paths.get("input.txt")));

      

You can specify the encoding that String should use when reading bytes from a file using the constructor String(bytes, encoding)

. Therefore, you can write it as new String(butes,"UTF-8")

or to avoid typos when choosing an encoding, use one of those stored in the class StandardCharsets

, for example new String(bytes, StandardCharsets.UTF_8)

.

+2


source


(?===)|(?<===)|\s|(?<!=)(?==)|(?<==)(?!=)|(?=\p{P})|(?<=\p{P})|(?=\+)

      

You can try this demo.

http://regex101.com/r/wQ1oW3/18

0


source







All Articles