Fastest way to parse string of numbers into vector ints

I am wondering what is the fastest way to parse a string of numbers into a vector ints. My situation is that I will have millions of rows of data formatted like this:

>Header-name
ID1    1    1   12
ID2    3    6   234
.
.
.
>Header-name
ID1    1    1   12
ID2    3    6   234
.
.
.

      

I would like to drop the Header Title field (or perhaps use it to sort later) and then ignore the ID field and then put the remaining three ints into the vector. I understand that I could just use boost split and then lexical cast in multiple loops with logic to ignore certain data, but I'm not sure if that would give me the fastest solution. I have looked at the spirit of the spirit, but I really don't understand how to use it. Boost or STL are fine.

+3


source to share


4 answers


Do you need to use boost? I have been using this feature for a while. I believe I got it from Accelerated C ++ and have been using it ever since. Your separator appears to be a tab or multiple spaces. If you pass in the delimiter "" it might work. I think it will depend on what actually exists.

std::vector<std::string> split( const std::string& line, const std::string& del )
{
        std::vector<std::string> ret;
        size_t i = 0;

        while ( i != line.size() ) {

                while ( ( i != line.size() ) && ( line.substr(i, 1) == del ) ) {
                        ++i;
                }

                size_t j = i;

                while ( ( j != line.size() ) && ( line.substr(j, 1) != del ) ) {
                        ++j;
                }

                if ( i != j ) {
                        ret.push_back( line.substr( i, j - i ) );
                        i = j;
                }
        }

        return ret;
}

      

You can get each line like this:



int main() {
    std::string line;
    std::vector<std::string> lines; 
    while ( std::getline( std::cin, line ) ) {
        lines.push_back( line );
    }

    for ( auto it = lines.begin(); it != lines.end(); it++ ) {
        std::vector<string> vec = split( (*it) );
        // Do something
    }
}

      

You can return it to a std :: vector with a quick modification. Make each int string with atoi (myString.c_str ()) Also you want to put a check to skip the headers. Should be trivial.

Please note that I didn't compile this above.;)

+1


source


For this particular problem, if you want the fastest, I would recommend manually parsing 1 char at a time. Boost Spirit will probably be the second second and save you a lot of ugly code.

Parsing one char at a time by hand is the key to high speed, as even optimized optimizers like atoi and strtol deal with many different numeric representations, while your example seems to imply that you are only interested in simple unsigned integers, IO formatted (scanf, operator <<, etc.) are very slow. Reading lines into intermediate lines will likely have a visible cost.



Your problem is simple enough to parse manually, assuming the header lines do not contain any "\ t" (and assuming no I / O or format errors exist):

#include <iostream>
#include <sstream>
#include <vector>
#include <string>

std::vector<unsigned> parse(std::istream &is)
{
    bool skipField = true;
    char c;
    unsigned value = 0;
    std::vector<unsigned> result;
    while (is.get(c))
    {
        if (('\t' == c) || ('\n' == c))
        {
            if (!skipField)
            {
                result.push_back(value);
            }
            skipField = ('\n' == c);
            value = 0;
        }
        else if (!skipField)
        {
            value *= 10;
            value += (c - '0');
        }
    }
    return result;
}

int main()
{
    const std::string data = ">Header-name\nID1\t1\t1\t12\nID2\t3\t6\t234\n";
    std::istringstream is(data);
    const std::vector<unsigned> v = parse(is);
    for (unsigned u: v)
    {
        std::cerr << u << std::endl;
    }
}

      

+1


source


As always, with deliciously handpicked questions like this, there isn't much more than just showing a "way" of doing a "thing." In this case, I used Boost Spirit (because you mentioned it):

Disassembly into flat containers

#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
#include <map>

std::string const input(
    ">Header - name1\n"
    "ID1    1    1   12\n"
    "ID2    3    6   234\n"
    ">Header - name2\n"
    "ID3    3    3   14\n"
    "ID4    5    8   345\n"
);

using Header    = std::string;
using Container = std::vector<int>;
using Data      = std::map<Header, Container>;

int main()
{
    namespace qi = boost::spirit::qi;

    auto f(input.begin()), l(input.end());

    Data data;
    bool ok = qi::phrase_parse(f, l,
        *(
            '>' >> qi::raw[*(qi::char_ - qi::eol)] >> qi::eol
           >> *(!qi::char_('>') >> qi::omit[qi::lexeme[+qi::graph]] >> *qi::int_ >> qi::eol)
        ), qi::blank, data);

    if (ok)
    {
        std::cout << "Parse success\n";
        for (auto const& entry : data)
        {
            std::cout << "Integers read with header '" << entry.first << "':\n";
            for (auto i : entry.second)
                std::cout << i << " ";
            std::cout << "\n";
        }
    }
    else
    {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Remaining input: '" << std::string(f, l) << "'\n";
}

      

Printing

Parse success
Integers read with header 'Header - name1':
1 1 12 3 6 234
Integers read with header 'Header - name2':
3 3 14 5 8 345

      

Analysis in nested containers

Of course, if you want separate vectors for each line (don't expect efficiency), you can simply replace the typedef:

using Container = std::list<std::vector<int> >; // or any other nested container

// to make printing work without further change:
std::ostream& operator<<(std::ostream& os, std::vector<int> const& v)
{
    os << "[";
    std::copy(v.begin(), v.end(), std::ostream_iterator<int>(os, " "));
    return os << "]";
}

      

Printing

Parse success
Integers read with header 'Header - name1':
[1 1 12 ] [3 6 234 ]
Integers read with header 'Header - name2':
[3 3 14 ] [5 8 345 ]

      

+1


source


You can use something like the following only instead of the used array of strings, you will get the strings from the file

#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>

int main() 
{
    std::string s[] = { "ID1    1    1   12", "ID2    3    6   234" };
    std::vector<int> v;

    for ( const std::string &t : s )
    {
        std::istringstream is( t );
        std::string tmp;

        is >> tmp;

        v.insert( v.end(), std::istream_iterator<int>( is ), 
                           std::istream_iterator<int>() );
    }                         

    for ( int x : v ) std::cout << x << ' ';
    std::cout << std::endl;

    return 0;
}

      

Output signal

1 1 12 3 6 234 

      

As for the header, then you can check if tmp is a header and if you miss that entry.

Here is a simplified version

#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>

int main() 
{
    std::string s[] = 
    { 
        "ID1    1    1   12", 
        ">Header-name", 
        "ID2    3    6   234" 
    };

    std::vector<int> v;

    for ( const std::string &t : s )
    {
        std::istringstream is( t );
        std::string tmp;

        is >> tmp;

        if ( tmp[0] == '>' ) continue;

        v.insert( v.end(), std::istream_iterator<int>( is ), 
                           std::istream_iterator<int>() );
    }                         

    for ( int x : v ) std::cout << x << ' ';
    std::cout << std::endl;

    return 0;
}

      

The output will be the same as above.

0


source







All Articles