Tokenize C ++ string with regex having special characters

I am trying to find tokens in a string that has words, numbers and special characters. I tried the following code:

#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
    string str("The ,quick brown. fox \"99\" named quick_joe!");
    regex reg("[\\s,.!\"]+");
    sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
    vector<string> vec(iter, end);
    for (auto a : vec) {
        cout << a << ":";
    }
    cout    << endl;
}

      

And got the following output:

The:quick:brown:fox:99:named:quick_joe:

      

But I want the result:

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

      

What regex should I use for this? I would like to stick with standard C ++ if possible, i.e. I wouldn't like the boost solution.

(see 43594465 for the java version of this question, but now I'm looking for a C ++ solution. So essentially the question is how to map Java Matcher and Pattern for C ++.)

+3


source to share


2 answers


You are asking for a substring alternation of a substring (nudge -1) with all matched substrings (fake 0), which is slightly different:

sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;

      

This gives:

The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:

      




Since you just want to remove spaces, change the regex to use surrounding spaces and add a capturing group for non-spaces. Then just specify subroutine 1 in the iterator instead of sending 0:

regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;

      

Productivity:

The:,:quick brown:.:fox:":99:":named quick_joe:!:

      




Separating spaces between adjacent words requires splitting into "just spaces":

regex reg("\\s*\\s|([,.!\"]+)\\s*");

      

However, you end up with empty submatrices:

The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:

      




Easy enough to omit:

regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });

      

Finally:

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

      

+3


source


If you want to take the approach taken in the Java related question, use the appropriate approach here as well.

regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);

      

See C ++ demo . Result: The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

. Note that this will not match Unicode letters here as \w

( \d

and \s

) is also not Unicode in std::regex

.



Template details :

  • \d+

    - 1 or more digits
  • |

    - or
  • [^\W\d]+

    - 1 or more ASCII letters, or _

  • |

    - or
  • [^\w\s]

    - 1 char except ASCII letter / numbers, _

    and spaces.
+1


source







All Articles