C ++: parsing a string of numbers with parentheses in it

It seems trivial, but I cannot get around it. I have STL format strings 2013 336 (02 DEC) 04

(where 04

is the hour, but it doesn't matter). I would like to highlight the day of the month ( 02

in the example) and the month as well as the hour.

I am trying to keep it clean and avoid for example. splitting the string in parentheses and then working with substrings, etc. Ideally I would like to use stringstream

and just redirect it to variables. The code I have now:

int year, dayOfYear, day;
std::string month, leftParenthesis, rightParenthesis;
std::string ExampleString = "2013 336 (02 DEC) 04";

std::istringstream yearDayMonthHourStringStream( ExampleString );
yearDayMonthHourStringStream >> year >> dayOfYear >> leftParenthesis >> day >> month >> rightParenthesis >> hour;

      

He extracts year

and dayOfYear

is fine as 2013

well 336

, but then things start badly. day

is this 0

, month

and an empty string, and hour

843076624.

leftParenthesis

(02

so it contains day

, but when I try to loop through the variable leftParenthesis

and the stream redirection yearDayMonthHourStringStream

day

also 0

.

Any ideas on how to deal with this? I don't know regexes (yet) and admittedly not sure if I can let them learn them right now (in time).

EDIT Ok, I have it. Although this looks like the billionth time I could make my life a lot easier with a regex, so I guess it's about time. Anyway, what happened:

int year, dayOfYear, day, month, hour, minute, revolution;
std::string dayString, monthString;

yearDayMonthHourStringStream >> year >> dayOfYear >> dayString >> monthString >> hour;
std::string::size_type sz;
day = std::stod( dayString.substr( dayString.find("(")+1 ), &sz ); // Convert day to a number using C++11 standard. Ignore the ( that may be at the beginning.

      

This still needs to be handled monthString

, but I still need to change it to a number, so this is not a huge disadvantage. Not the best thing you can do (regex), but it doesn't work too messy. It is vaguely portable as far as I know, and hopefully won't stop working with newer compilers. But thanks to everyone.

+3


source to share


5 answers


The obvious solution is to use regular expressions (either std::regex

, in C ++ 11 or boost::regex

pre C ++ 11). Just grab the groups you are interested in and use std::istringstream

to transform them if necessary. In this case,

std::regex re( "\\s*\\d+\\s+\\d+\\s*\\((\\d+)\\s+([[:alpha:]]+))\\s*(\\d+)" );

      

Got to do the trick.

And regular expressions are really very simple; it will take you less time to learn them than to implement any alternative Solution.

For an alternative solution, you probably want to read the line character by breaking it down into tokens. Anything along the line:

std::vector<std::string> tokens;
std::string currentToken;
char ch;
while ( source.get(ch) && ch != '\n' ) {
    if ( std::isspace( static_cast<unsigned char>( ch ) ) ) {
        if ( !currentToken.empty() ) {
            tokens.push_back( currentToken );
            currentToken = "";
        }
    } else if ( std::ispunct( static_cast<unsigned char>( ch ) ) ) {
        if ( !currentToken.empty() ) {
            tokens.push_back( currentToken );
            currentToken = "";
        }
        currentToken.push_back( ch );
    } else if ( std::isalnum( static_cast<unsigned char>( ch ) ) ) {
        currentToken.push_back( ch );
    } else {
        //  Error: illegal character in line.  You'll probably
        //  want to throw an exception.
    }
}
if ( !currentToken.empty() ) {
    tokens.push_back( currentToken );
}

      

In this case, the sequence of alphanumeric characters is one token, just like any punctuation character. You could go further by ensuring that the token is either all alpha or all digits, and possibly rearranging the punctuation sequences, but that seems to be sufficient for your problem.

Once you have a list of tokens, you can do whatever checks you need (parentheses in the correct places, etc.) and convert the tokens you are interested in if conversions need them.

EDIT:

FWIW: I've experimented with using auto

plus lambda as a means of defining nested functions. My opinion was not whether a good idea or not: I do not always find results that can be read. But in this case:

auto pushToken = [&]() {
    if ( !currentToken.empty() ) {
        tokens.push_back( currentToken );
        currentToken = "";
    }
}

      



Just before the loop, replace everything if

with pushToken()

. (Or you can create a data structure with tokens

, currentToken

and a pushToken

. This will work even in pre-C ++ 11.)

EDIT:

One final note, since the OP seems to want to do this exclusively with std::istream

: the solution would be to add a manipulator MustMatch

:

class MustMatch
{
    char m_toMatch;
public:
    MustMatch( char toMatch ) : m_toMatch( toMatch ) {}
    friend std::istream& operator>>( std::istream& source, MustMatch const& manip )
    {
        char next;
        source >> next;
        //  or source.get( next ) if you don't want to skip whitespace.
        if ( source && next != m_toMatch ) {
            source.setstate( std::ios_base::failbit );
        }
        return source;
    }
}

      

As @Angew pointed out, you'll also need >>

for months; usually months will be represented as a class, so you overloaded >>

like this:

std::istream& operator>>( std::istream& source, Month& object )
{
    //      The sentry takes care of skipping whitespace, etc.
    std::ostream::sentry guard( source );
    if ( guard ) {
        std::streambuf* sb = source.rd();
        std::string monthName;
        while ( std::isalpha( sb->sgetc() ) ) {
            monthName += sb->sbumpc();
        }
        if ( !isLegalMonthName( monthName ) ) {
            source.setstate( std::ios_base::failbit );
        } else {
            object = Month( monthName );
        }
    }
    return source;
}

      

You could of course enter many variations here: the month name can be limited to a maximum of 3 characters, for example (by fulfilling a loop condition monthName.size() < 3 && std::isalpha( sb->sgetc() )

). But if you are dealing with months in any way in your code, writing a class Month

and its Operators >>

and <<

- this is what you will need to do sooner or later anyway.

Then something like:

source >> year >> dayOfYear >> MustMatch( '(' ) >> day >> month
       >> MustMatch( ')' ) >> hour;
if ( !(source >> ws) || source.get() != EOF ) {
    //  Format error...
}

      

is all that is needed. (Using manipulators like this is another technique worth exploring.)

+7


source


@Angew +1 for scanf()

. It will do what you want in one line:



int day;
int hour;
char month[4];
int result = sscanf(ExampleString.c_str(), "%*d %*d (%d %3s) %d", &day, month, &hour);
if (result != 3)
{
    // parse error;
}

      

+3


source


Working example for regex http://coliru.stacked-crooked.com/a/ac5a4c9269e94344

(no string parsing)

#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
    //int year, dayOfYear, day;
    //std::string month, leftParenthesis, rightParenthesis;
    std::string ExampleString = "2013 336 (02 DEC) 04";
    regex pattern("\\s*(\\d+)\\s+(\\d+)\\s*\\((\\d+)\\s+([[:alpha:]]+)\\)\\s*(\\d+)\\s*");

    // Matching single string
    std::smatch sm;
    if (std::regex_match(ExampleString, sm, pattern)) {
        cout << "year: " << sm[1].str() << endl;
        cout << "dayOfYear: " << sm[2].str() << endl;
        cout << "day: " << sm[3].str() << endl;
        cout << "month: " << sm[4].str() << endl;
        cout << "hour: " << sm[5].str() << endl;
    }

    cout << endl;
    cout << endl;

    // If your data contains multiple lines to parse, use this version
    // unfortunately it will skip all lines that does not match pattern.
    ExampleString = "2013 336 (02 DEC) 04" "\n2014 336 (02 DEC) 04" "\n2015 336 (02 DEC) 04";
    for (sregex_iterator it(ExampleString.begin(), ExampleString.end(), pattern), end_it;
        it != end_it; ++it)
    {
        cout << "year: " << (*it)[1].str() << endl;
        cout << "dayOfYear: " << (*it)[2].str() << endl;
        cout << "day: " << (*it)[3].str() << endl;
        cout << "month: " << (*it)[4].str() << endl;
        cout << "hour: " << (*it)[5].str() << endl;
        cout << endl;
    }
}

      

It doesn't accept below for debuggex [[:alpha:]]

, so \ w replaces it, although [a-zA-Z] would be better:

\s*(\d+)\s+(\d+)\s*\((\d+)\s+(\w+)\)\s*(\d+)\s*

      

Regular expression visualization

Demo Debuggex

+2


source


If you really don't want to use regexes and you want the hack to look as similar as possible to what you already have ... you can just replace the parentheses in the string with spaces. (I'm not saying this is a good solution, but it's worth knowing about it.)

int year, dayOfYear, day, hour;
std::string month;
std::string ExampleString = "2013 336 (02 DEC) 04";

std::replace_if(ExampleString.begin(), ExampleString.end(), [](char c) { return c == '(' || c == ')'; }, ' ');

std::istringstream yearDayMonthHourStringStream( ExampleString );
yearDayMonthHourStringStream >> year >> dayOfYear >> day >> month >> hour;

      

+1


source


FWIW, you can get the flow approach to work by reading left and right brackets into variables char

instead of strings, and on parsing month

when it sees the correct parenthesis ... gets a little ugly, though:

int year, dayOfYear, day;
std::string month;
char leftParenthesis, rightParenthesis;
std::string ExampleString = "2013 336 (02 DEC) 04";

std::istringstream yearDayMonthHourStringStream( ExampleString );
if (yearDayMonthHourStringStream >> year >> dayOfYear >> leftParenthesis
        >> day >> std::ws &&
    getline(yearDayMonthHourStringStream, month, ')') &&
    yearDayMonthHourStringStream >> rightParenthesis >> hour &&
    leftParenthesis == '(' && rightParenthesis == ')')
    ...use your variables...
else
    ...report bad input...

      

( <iomanip>

'ss is std::ws

used so that the tolerance for is ws

consistent throughout).

+1


source







All Articles