Fastest way to determine if a string contains a real or integer value
I am trying to write a function that can detect if a string contains a real or an integer value.
This is the simplest solution I could think of:
int containsStringAnInt(char* strg){
for (int i =0; i < strlen(strg); i++) {if (strg[i]=='.') return 0;}
return 1;
}
But this solution is very slow when the string is long ... Any suggestions for optimization? Any help would be really appreciated!
source to share
You are using strlen, which means you are not worried about unicode. In this case, why use strlen or strchr, just check "\ 0" (Null char)
int containsStringAnInt(char* strg){
for (int i =0;strg[i]!='\0'; i++) {
if (strg[i]=='.') return 0;}
return 1; }
Only one line parsing than line parsing in each loop iteration.
source to share
Is your string hundreds of characters long? Otherwise, don't worry about potential performance issues. The only inefficiency is that you are using strlen () the wrong way, which means a lot of iterations over the string (inside strlen). For a simpler solution with the same time complexity (O (n)), but probably slightly faster, use strchr ().
source to share
Your function ignores exponential reals notation (1E7, 1E-7 are both doubled)
Use strtol () to try to convert the string to an integer first; it will also return the first position in the line where the parsing failed (it will be "." if the number is real). If parsing stops at '.', Use strtod () to try to convert to double. Again, the function will return the position on the line where parsing stopped.
Don't worry about performance until you profile the program. Otherwise, for fast code, create a regex that describes valid number syntax and convert it to FSM first and then to highly optimized code.
source to share
So, standard note, please don't worry too much about performance if not profiled :)
I am not sure about manual looping and have checked point. Two questions
- Depending on the locale, the dot could be "," (here in Germany, in this case :)
- As others have pointed out, there is a problem with numbers like 1e7
Previously, I had a version using sscanf. But performance measurements showed that sscanf is significantly slower for large datasets. So I first show a faster solution (well, it's also all simpler. I had a few bugs in the sscanf version until I got it working, and the strto [ld] version worked on the first try):
enum {
REAL,
INTEGER,
NEITHER_NOR
};
int what(char const* strg){
char *endp;
strtol(strg, &endp, 10);
if(*strg && !*endp)
return INTEGER;
strtod(strg, &endp);
if(*strg && !*endp)
return REAL;
return NEITHER_NOR;
}
<sub>
Just for fun, here's the sscanf version:
int what(char const* strg) {
// test for int
{
int d; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%d %n", &d, &n);
if(!strg[n] && rd == 1) {
return INTEGER;
}
}
// test for double
{
double v; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%lf %n", &v, &n);
if(!strg[n] && rd == 1) {
return REAL;
}
}
return NEITHER_NOR;
}
I think this should work. Have fun. Sub>
The test was run by randomly transforming test strings (small) 10,000,000 times in a loop:
- 6.6s for
sscanf
- 1.7s for
strto[dl]
- 0.5s for
manual looping
up to "."
Clear the winnings for strto[ld]
, assuming it will parse the numbers correctly. I will praise him as a manual loop winner. Anyway, 1.2s / 10000000 = 0.00000012 difference for about one conversion is not much at the end.
source to share
Strlen looks through the string to find the length of the string.
You are calling strlen with each pass through the loop. Consequently, you are walking the string much more often than you need to. This tiny change should give you a huge performance improvement:
int containsStringAnInt(char* strg){
int len = strlen(strg);
for (int i =0; i < len; i++) {if (strg[i]=='.') return 0;}
return 1;
}
Note that all I did was find the length of the string once, at the beginning of the function, and reference that value multiple times in the loop.
Please let us know what performance improvement it brings you.
source to share
@Aaron, along the way, you are also looping through the line twice. Once inside strlen and again in a for loop. The best way for an ASCII loop to loop through is to check for Null char in the loop. Take a look at my answer which parses a string only once inside a loop and can be partial parse if it finds "." to end. that way, if the string is like 0.01xxx (of 100 characters), you don't have to go all the way to find the length.
source to share