Parse files the fastest way?

I am writing in a graph library that needs to read the most common graphics formats. One format contains the following information:

e 4 3
e 2 2
e 6 2
e 3 2
e 1 2
....

      

and I want to parse these lines. I looked at stackoverflow and found a neat solution for this. I am currently using an approach like this (the file is a fstream file):

string line;
while(getline(file, line)) {
    if(!line.length()) continue; //skip empty lines
    stringstream parseline = stringstream(line);
    char identifier;
    parseline >> identifier; //Lese das erste zeichen
    if(identifier == 'e')   {
        int n, m;
        parseline >> n;
        parseline >> m;
        foo(n,m) //Here i handle the input
    }
}

      

It works pretty well for its intended purpose, but today when I tested it with huge graph files (50mb +) I was shocked that this feature was the worst bottleneck in the entire program:

The string stream that I use to parse a string uses almost 70% of the total execution time, while the getline command uses 25%. The rest of the program uses only 5%.

Is there a quick way to read these large files, perhaps avoiding slow lines and getline functions?

+3


source to share


2 answers


You can skip double buffering your string, skip parsing a single character, and use strtoll

to parse integers like:

string line;
while(getline(file, line)) {
    if(!line.length()) continue; //skip empty lines
    if (line[0] == 'e') {
        char *ptr;
        int n = strtoll(line.c_str()+2, &ptr, 10);
        int m = strtoll(ptr+1, &ptr, 10);
        foo(n,m) //Here i handle the input
    }
}

      



In C ++ strtoll

should be in <cstdlib>

include file .

+3


source


mmap file and treat it as one big buffer.

If you are missing mmap, you can try the read

file into the buffer youmalloc



Rationale: Most of the time is spent going from user to system and back to library C calls. Reading the entire file eliminates almost all of these calls.

+1


source







All Articles