C ++ extract data from string

What's an elegant way to extract data from a string (perhaps using the boost library)?

Content-Type: text/plain
Content-Length: 15
Content-Date: 2/5/2013
Content-Request: Save

hello world

      

Let's say I have the line above and want to extract all fields, including the greeting text. Can anyone point me in the right direction?

+3


source to share


8 answers


Here's a pretty compact one, written in C: https://github.com/openwebos/nodejs/blob/master/deps/http_parser/http_parser.c



+4


source


Try



  • http://pocoproject.org/

    Comes with HTTPServer and Client implementations

  • http://cpp-netlib.github.com/

    Comes with request / response processing

  • Boost Spirit Demo : http://liveworkspace.org/code/3K5TzT

    You will need to provide a simple grammar (or complex grammar if you want to catch all the intricacies of HTTP)

    #include <boost/fusion/adapted.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/karma.hpp>
    
    typedef std::map<std::string, std::string> Headers;
    typedef std::pair<std::string, std::string> Header;
    struct Request { Headers headers; std::vector<char> content; };
    
    BOOST_FUSION_ADAPT_STRUCT(Request, (Headers, headers)(std::vector<char>, content))
    
    namespace qi    = boost::spirit::qi;
    namespace karma = boost::spirit::karma;
    
    template <typename It, typename Skipper = qi::blank_type>
        struct parser : qi::grammar<It, Request(), Skipper>
    {
        parser() : parser::base_type(start)
        {
            using namespace qi;
    
            header = +~char_(":\n") > ": " > *(char_ - eol);
            start = header % eol >> eol >> eol >> *char_;
        }
    
      private:
        qi::rule<It, Header(),  Skipper> header;
        qi::rule<It, Request(), Skipper> start;
    };
    
    bool doParse(const std::string& input)
    {
        auto f(begin(input)), l(end(input));
    
        parser<decltype(f), qi::blank_type> p;
        Request data;
    
        try
        {
            bool ok = qi::phrase_parse(f,l,p,qi::blank,data);
            if (ok)   
            {
                std::cout << "parse success\n";
                std::cout << "data: " << karma::format_delimited(karma::auto_, ' ', data) << "\n";
            }
            else      std::cerr << "parse failed: '" << std::string(f,l) << "'\n";
    
            if (f!=l) std::cerr << "trailing unparsed: '" << std::string(f,l) << "'\n";
            return ok;
        } catch(const qi::expectation_failure<decltype(f)>& e)
        {
            std::string frag(e.first, e.last);
            std::cerr << e.what() << "'" << frag << "'\n";
        }
    
        return false;
    }
    
    int main()
    {
        const std::string input = 
            "Content-Type: text/plain\n"
            "Content-Length: 15\n"
            "Content-Date: 2/5/2013\n"
            "Content-Request: Save\n"
            "\n"
            "hello world";
    
        bool ok = doParse(input);
    
        return ok? 0 : 255;
    }
    
          

+4


source


There are several solutions. If the format is that simple, you can just read the file line by line. If the string starts with a key, you can simply split it up to get the value. If it is not, the value is the string itself. This can be done with STL very easily and quite elegantly.

If the grammar is more complex, and since you've added boost to the tags, you might think that Boost Spirit parses it and gets meaning from it.

+2


source


The simplest solution, I believe, is to use regular expressions . There are standard regular expressions in C ++ 11 and some can be found in boost .

+2


source


You can use string::find

with a space to find where they are and then copy from that position until you find'\n'

+1


source


If you want to write code to parse it yourself, start by looking at the HTTP spec for that. This will give you the grammar:

    generic-message = start-line
                      *(message-header CRLF)
                      CRLF
                      [ message-body ]
    start-line      = Request-Line | Status-Line

      

So the first thing I would like to do is use split () on CRLF to split it into compound lines. Then you can iterate over the resulting vector. Until you get to an element that is an empty CRLF, you are parsing the header, so you divide by the first ":" to get the key and value.

Once you click on an empty element, you will parse the response body.

Warning: Having done this myself in the past, I can tell you that not all webservers are composed of line endings (you can only find CR or only LF in places) and not all browsers / other layers of abstraction agree with what they convey to you ... This way you can find additional CRLFs in places you would not expect, or missing CRLFs in places you expect them to. Good luck.

+1


source


If you are ready to unwrap your loop manually, you can also use the std::istringstream

normal overloads of the extract operator (with appropriate manipulators, such as get_time()

for working with dates) to extract your data in a simple way.

Another possibility is to use std::regex

to match all patterns such as <string>:<string>

, and repeat all matches (grammar egrep

seems promising if you have multiple lines to process).

Or, if you want to do it in a complex way and your string has specific syntax, you can use Boost.Spirit to easily define the grammar and generate the parser.

0


source


If you have access to C + 11, you can use std :: regex ( http://en.cppreference.com/w/cpp/regex ).

std::string input = "Content-Type: text/plain";
std::regex contentTypeRegex("Content-Type: (.+)");

std::smatch match;

if (std::regex_match(input, match, contentTypeRegex)) {
     std::ssub_match contentTypeMatch = match[1];
     std::string contentType = contentTypeMatch.str();
     std::cout << contentType;
}
//else not found

      

Compilation / working version here: http://ideone.com/QTJrue

This regex is a very simplified case, but it is the same principle for multiple fields.

0


source







All Articles