Tokenize C ++ string with regex having special characters

Question

Tokenize C ++ string with regex having special characters

I am trying to find tokens in a string that has words, numbers and special characters. I tried the following code:

#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
    string str("The ,quick brown. fox \"99\" named quick_joe!");
    regex reg("[\\s,.!\"]+");
    sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
    vector<string> vec(iter, end);
    for (auto a : vec) {
        cout << a << ":";
    }
    cout    << endl;
}

And got the following output:

The:quick:brown:fox:99:named:quick_joe:

But I want the result:

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

What regex should I use for this? I would like to stick with standard C ++ if possible, i.e. I wouldn't like the boost solution.

(see 43594465 for the java version of this question, but now I'm looking for a C ++ solution. So essentially the question is how to map Java Matcher and Pattern for C ++.)

+3

c ++ regex

R71 Apr 26. 17 at 7:18

source to share

2 answers

If you want to take the approach taken in the Java related question, use the appropriate approach here as well.

regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);

See C ++ demo . Result: The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

. Note that this will not match Unicode letters here as \w

( \d

and \s

) is also not Unicode in std::regex

.

Template details :

\d+

- 1 or more digits
|

- or
[^\W\d]+

- 1 or more ASCII letters, or _
|

- or
[^\w\s]

- 1 char except ASCII letter / numbers, _

and spaces.

+1

Wiktor Stribiżew Apr 26. 17 at 7:47

source to share

Jeff gilbert · Accepted Answer · 2017-04-26T07:47:43+0000

You are asking for a substring alternation of a substring (nudge -1) with all matched substrings (fake 0), which is slightly different:

sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;

This gives:

The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:

Since you just want to remove spaces, change the regex to use surrounding spaces and add a capturing group for non-spaces. Then just specify subroutine 1 in the iterator instead of sending 0:

regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;

Productivity:

The:,:quick brown:.:fox:":99:":named quick_joe:!:

Separating spaces between adjacent words requires splitting into "just spaces":

regex reg("\\s*\\s|([,.!\"]+)\\s*");

However, you end up with empty submatrices:

The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:

Easy enough to omit:

regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });

Finally:

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

Tokenize C ++ string with regex having special characters

More articles: