Tokenize C ++ string with regex having special characters
I am trying to find tokens in a string that has words, numbers and special characters. I tried the following code:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
string str("The ,quick brown. fox \"99\" named quick_joe!");
regex reg("[\\s,.!\"]+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
vector<string> vec(iter, end);
for (auto a : vec) {
cout << a << ":";
}
cout << endl;
}
And got the following output:
The:quick:brown:fox:99:named:quick_joe:
But I want the result:
The:,:quick:brown:.:fox:":99:":named:quick_joe:!:
What regex should I use for this? I would like to stick with standard C ++ if possible, i.e. I wouldn't like the boost solution.
(see 43594465 for the java version of this question, but now I'm looking for a C ++ solution. So essentially the question is how to map Java Matcher and Pattern for C ++.)
source to share
You are asking for a substring alternation of a substring (nudge -1) with all matched substrings (fake 0), which is slightly different:
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;
This gives:
The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:
Since you just want to remove spaces, change the regex to use surrounding spaces and add a capturing group for non-spaces. Then just specify subroutine 1 in the iterator instead of sending 0:
regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
Productivity:
The:,:quick brown:.:fox:":99:":named quick_joe:!:
Separating spaces between adjacent words requires splitting into "just spaces":
regex reg("\\s*\\s|([,.!\"]+)\\s*");
However, you end up with empty submatrices:
The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:
Easy enough to omit:
regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });
Finally:
The:,:quick:brown:.:fox:":99:":named:quick_joe:!:
source to share
If you want to take the approach taken in the Java related question, use the appropriate approach here as well.
regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);
See C ++ demo . Result: The:,:quick:brown:.:fox:":99:":named:quick_joe:!:
. Note that this will not match Unicode letters here as \w
( \d
and \s
) is also not Unicode in std::regex
.
Template details :
-
\d+
- 1 or more digits -
|
- or -
[^\W\d]+
- 1 or more ASCII letters, or_
-
|
- or -
[^\w\s]
- 1 char except ASCII letter / numbers,_
and spaces.
source to share