How to delimit both "=" and "==" in Java when reading
I want to be able to output both "==" and "=" as tokens.
For example, the input text file:
biscuit==cookie apple=fruit+-()
Output:
biscuit = = cookie apple = fruit + - ( )
I want the output to be:
biscuit == cookie apple = fruit + - ( )
Here is my code:
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("input.txt")));
s.useDelimiter("\\s|(?<=\\p{Punct})|(?=\\p{Punct})");
while (s.hasNext()) {
String next = s.next();
System.out.println(next);
}
} finally {
if (s != null) {
s.close();
}
}
Thank.
Edit: I want to have the current regex.
source to share
You may be able to qualify these punctuation with some additional statements.
# "\\s|(?<===)|(?<=\\p{Punct})(?!(?<==)(?==))|(?=\\p{Punct})(?!(?<==)(?==))"
\s
| (?<= == )
| (?<= \p{Punct} )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} )
(?!
(?<= = )
(?= = )
)
Updating information
If some characters are not included in \p{Punct}
, just add them as a separate class to
punctuation subexpressions.
For engines that do not fulfill certain properties within classes use this ->
# Raw: \s|(?<===)|(?<=\p{Punct}|[=+])(?!(?<==)(?==))|(?=\p{Punct}|[=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
For engines that handle properties within classes well, this is the best option ->
# Raw: \s|(?<===)|(?<=[\p{Punct}=+])(?!(?<==)(?==))|(?=[\p{Punct}=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
| (?= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
source to share
Just split the input line according to the below expression.
String s = "biscuit==cookie apple=fruit";
String[] tok = s.split("\\s+|\\b(?==+)|(?<==)(?!=)");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit]
Explanation:
-
\\s+
Matches one or more space characters. -
|
OR -
\\b(?==+)
Matches a word boundary only if it is followed by a character=
. -
|
OR -
(?<==)
Look at the symbol=
. -
(?!=)
And match the border only if it's not followed by a character=
.
Update:
String s = "biscuit==cookie apple=fruit+-()";
String[] tok = s.split("\\s+|(?<!=)(?==+)|(?<==)(?!=)|(?=[+()-])");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit, +, -, (, )]
source to share
In other words, you want to split by
- one or more spaces
- which has
=
after it and not=
before it (for examplefoo|=
, where|
this place represents) - which has
=
before it and not=
after it (for example=|foo
, where|
this place represents)
In other words
s.useDelimiter("\\s+|(?<!=)(?==)|(?<==)(?!=)");
// ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^
//cases: 1) 2) 3)
Since it looks like you are building a parser, I would suggest using a tool that will allow you to build the correct grammar, such as http://www.antlr.org/ . But if you have to stick with the regex, then another improvement that will make it easier for you to create regex would be to use Matcher#find
instead of the delimiter from the Scanner. So your regex and code might look like
String data = "biscuit==cookie apple=fruit+-()";
String regex = "<=|==|>=|[\\Q<>+-=()\\E]|[^\\Q<>+-=()\\E]+";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group());
Output:
biscuit
==
cookie apple
=
fruit
+
-
(
)
You can make this regex more general with
String regex = "<=|==|>=|\\p{Punct}|\\P{Punct}+";
// ^^^^^^^^^^ ^^^^^^^^^^^-- standard cases
// ^^ ^^ ^^------------------------- special cases
Also this approach would require you to read data from the file first and store it in one line that you parsed. You can find many ways to read text from a file, for example in this question: Reading a Plain Text File in Java
so that you can use something like
String data = new String(Files.readAllBytes(Paths.get("input.txt")));
You can specify the encoding that String should use when reading bytes from a file using the constructor String(bytes, encoding)
. Therefore, you can write it as new String(butes,"UTF-8")
or to avoid typos when choosing an encoding, use one of those stored in the class StandardCharsets
, for example new String(bytes, StandardCharsets.UTF_8)
.
source to share