Find both min and max together: the algorithm should be faster, but not
I am trying to implement an algorithm to find the minimum and maximum values ββamong many lengths in a file. My test file is one billion in length.
The algorithm works as expected, but doesn't run faster than the naive version. This should be significantly faster as the naive version does roughly 2n comparisons whereas this version does 3n / 2 comparisons.
$ time ./findminmax_naive somelongs
count: 1000000000
min: 0
max: 2147483647
real 0m24.156s
user 0m4.956s
sys 0m3.896s
$ time ./findminmax_faster somelongs
count: 1000000000
min: 0
max: 2147483647
real 0m25.048s
user 0m6.948s
sys 0m3.980s
Here's the "naive" version:
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int ac, char *av[])
{
FILE *f;
long count, readcount, i, min, max;
size_t rlen;
long *n;
if (ac != 2 && ac != 3) {
fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
exit(1);
}
f = fopen(av[1], "r");
if (f == NULL)
perror("fopen");
readcount = 1024;
if (ac == 3)
readcount = atol(av[2]);
n = alloca(sizeof (long) * readcount);
rlen = fread(n, sizeof (*n), 1, f);
min = max = n[0];
count = 1;
while (1) {
rlen = fread(n, sizeof (*n), readcount, f);
for (i = 0; i < (long)rlen; i++) {
count++;
if (n[i] < min)
min = n[i];
if (n[i] > max)
max = n[i];
}
if (feof(f))
break;
}
printf("count: %ld\n", count);
printf("min: %ld\n", min);
printf("max: %ld\n", max);
exit(0);
}
Here is the code for the "future" version "faster":
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int ac, char *av[])
{
FILE *f;
long count, readcount, i, min, max;
size_t rlen;
long *n;
if (ac != 2 && ac != 3) {
fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
exit(1);
}
f = fopen(av[1], "r");
if (f == NULL)
perror("fopen");
readcount = 1024;
if (ac == 3)
readcount = atol(av[2]);
readcount = (readcount + 1) & (-1 << 1);
n = alloca(sizeof (long) * readcount);
rlen = fread(n, sizeof (*n), 1, f);
min = max = n[0];
count = 1;
while (1) {
rlen = fread(n, sizeof (*n), readcount, f);
for (i = 0; i < (long)rlen; i += 2) {
count += 2;
if (n[i] < n[i + 1]) {
if (n[i] < min)
min = n[i];
if (n[i + 1] > max)
max = n[i + 1];
} else {
if (n[i + 1] < min)
min = n[i + 1];
if (n[i] > max)
max = n[i];
}
}
if (feof(f))
break;
}
if (rlen % 2) {
if (n[rlen - 1] < min)
min = n[rlen - 1];
if (n[rlen - 1] > max)
max = n[rlen - 1];
count++;
}
printf("count: %ld\n", count);
printf("min: %ld\n", min);
printf("max: %ld\n", max);
exit(0);
}
Do you know what I missed?
Thank you for your help.
- Jeremy
source to share
The key is branch prediction. If the file is not sorted in the pathological worst-case order, the naive version will execute 2n branches, which are predicted correctly almost every time. Your smart version does n / 2 branches that are almost never predicted correctly, and an additional n comparisons that can be predicted correctly.
How much mispredicted branch costs are highly dependent on the processor architecture and even the particular processor model, but at least I expect a mispredicted branch to cost several times more than a correctly predicted one. As a last resort, correctly predicted branches can have an effective zero cycle cost.
As an interesting example, I recently experimented with optimizations strlen
and found that individually the extremely naive expanded strlen
- comparing and branching one byte at a time - was faster than clever vectorized approaches. This is almost certainly because it strlen
has the special property that every branch to the last will always be correctly predicted.
By the way, to test my hypothesis, try this input pattern:
999999999, 1000000001, 999999998, 1000000002, 999999997, 1000000003, ...
This will give the worst case prediction for the naive algorithm and the best case for the outer conditional on your smart version.
source to share
as @chr said, "The I / O file will overshadow any optimization in the algorithm itself."
Also, less comparison does not mean less time consumption . These two algorithms have O (n) time complexity that ignores actual operator overhead and abstract overhead.
For example, as two rough frames of these two algorithms, the time consumption is the time of all operating costs in your program.
For example:
//max and min initlaized as 0.
//c1,... reprents the time cost of each instruction.
while(i<count) {//c1
if(a[i]>max) //c2
max = a[i]; //c3
i++; //c4
}
//search of min is like below
cost of time:
T1 = 2n * c1 + 2n * c2 + x * c3 + y * c3 + 2n * c4 = 2n * (c1 + c2 + c4) + (x + y) * c3
which x and y match the order of your data.
And, comparison (3/2) n,
while(i<count) //c1
if(a[i]<a[i+1]) {//c5
if(a[i]<min) //c2
min = a[i]; //c3
if(a[i+1>max]) //c2
max = a[i+1]; //c3
}
else
...
//same as below,that swap i and i+1
i+=2; //c6
}
cost of time:
T2 = n * c1 + n * c5 + n * 2 * c2 + (x '+ y') * c3 + n * c6 = n * (c1 + c5 + c6) + 2n * c2 + (x '+ y' ) * c3
if max and min are the first two items of your data, x = x '= 1; y = y '= 1.
T1-T2 = n * c1 + 2n * c4-n * c5 -n * c6. to differentiate the complier, T1-T2 may be different.
More complicated is that x, y, x ', y' are variables, but I will not continue to discuss this. My goal is to show what the real value of labor time is.
If you are still confused after reading this above, you should read Chapter 2.2 "Introduction to Algorithms".
source to share
First of all, excuse me for answering all questions in one answer. I know I shouldn't be doing this on stackoverflow.com, but given that different topics are more or less intertwined, this is easier.
Introduction
So here's the code I'm currently using to test the algorithm. Differences with previous versions:
- it includes both versions that you choose with the command line argument;
- @chr, @Dukeling: it mmap (2) s file to prevent system or library calls;
- @chr, @Dukeling: it has a "warm-up" option to eliminate all pages in memory before running the selected algorithm;
- @Dukeling: The program will record the time taken by the algorithm alone using gettimeofday (2).
Here is the code:
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define ROUNDUP(x, y) ((((x) + ((y) - 1)) / (y)) * (y))
#define USCARRY(x) ((x) < 0 ? (x) + 1000000 : (x))
int
main(int ac, char *av[])
{
struct stat st;
struct timeval start, end;
long count, min, max, pgsiz;
long *n, *p, *endp;
int fd, warmup;
if (ac != 3 && ac != 4) {
fprintf(stderr, "Usage: %s <\"trivial\"|\"faster\"> "
"<file> [warmup]\n", av[0]);
exit(1);
}
fd = open(av[2], O_RDONLY);
if (fd == -1)
perror("open");
fstat(fd, &st);
pgsiz = sysconf(_SC_PAGESIZE);
n = mmap(NULL, ROUNDUP(st.st_size, pgsiz), PROT_READ,
MAP_SHARED, fd, 0);
if (n == MAP_FAILED)
perror("mmap");
warmup = 0;
if (ac == 4)
warmup = atoi(av[3]);
// warm up the filesystem cache
count = st.st_size / sizeof (*n);
endp = &n[count - 1];
//printf("%zu\n", sizeof (*p));
//printf("%zu\n", sizeof (count));
//exit(0);
if (warmup) {
for (p = n; p <= endp; p++) {
fwrite(p, sizeof (*p), 1, stdout);
min = *p;
}
}
// start algorithm
gettimeofday(&start, NULL);
if (!strcmp(av[1], "trivial")) {
min = max = n[0];
p = &n[1];
while (p <= endp) { // c1 * n
if (p[0] < min) // c2 * n
min = p[0]; // c3 * x
if (p[0] > max) // c2 * n
max = p[0]; // c3 * y
p++; // c4 * n
}
} else if (!strcmp(av[1], "faster")) {
min = max = n[0];
p = &n[1];
while (p < endp) { // c1 * n/2
if (p[0] < p[1]) { // c2 * n/2
if (p[0] < min) // c2 * n/4
min = p[0]; // c3 * x/2
if (p[1] > max) // c2 * n/4
max = p[1]; // c3 * y/2
} else {
if (p[1] < min) // c2 * n/4
min = p[1]; // c3 * x/2
if (p[0] > max) // c2 * n/4
max = p[0]; // c3 * y/2
}
p += 2; // c5 * n
}
if (p == endp) {
if (*endp < min)
min = *endp;
else if (*endp > max)
max = *endp;
}
} else {
printf("sorry\n");
exit(1);
}
gettimeofday(&end, NULL);
printf("time: %ld.%ld\n", end.tv_sec - start.tv_sec,
USCARRY(end.tv_usec - start.tv_usec));
printf("count: %ld\n", count);
printf("min: %ld\n", min);
printf("max: %ld\n", max);
exit(0);
}
Test cases
Here are the files I'm using for the test case:
$ ls -l _*
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:37 _bestcase
-rw-r--r-- 1 jlh jlh 2400000000 May 27 08:40 _random
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:38 _worstcase
$ od -N 64 -t dL _bestcase
0000000 0 300000000
0000020 1 299999999
0000040 2 299999998
0000060 3 299999997
0000100
$ od -N 64 -t dL _random
0000000 8600270969454287374 8436406000964321037
0000020 7348498334162074664 2050332610730417325
0000040 8183771519386464726 4134520779774161130
0000060 2045475555785071180 2859007060406233926
0000100
$ od -N 64 -t dL _worstcase
0000000 150000000 150000000
0000020 149999999 150000001
0000040 149999998 150000002
0000060 149999997 150000003
0000100
I / O penalty
Okay, first open the cache and make sure there is no missing page that could mess up the results:
$ ./findminmax trivial _random
time: 3.543573
count: 300000000
min: 31499144704
max: 9223372004409096866
$ ./findminmax trivial _random
time: 1.466323
count: 300000000
min: 31499144704
max: 9223372004409096866
$ perf stat -e minor-faults,major-faults ./findminmax trivial _random
time: 1.284729
count: 300000000
min: 31499144704
max: 9223372004409096866
Performance counter stats for './findminmax trivial _random':
586,066 minor-faults
0 major-faults
1.350118552 seconds time elapsed
So, you can see that there were no major page errors. From now on, we can agree that there will be no I / O impact. 2. Number of teams
Disadvantages of teams and missing branches
@ H2CO3, @vvy, you are absolutely correct that other instructions count as well (I will assume that each operation takes the same number of cpu cycles here, and will live in case of cpu cache failure). I don't know why the literary man I've read so far about algorithms only focuses on the number of comparisons (well, I admit I haven't read much though;)).
I have commented out each step in the loop with my own average case calculation (I think the worst case is not interesting here) and this is slightly different for yours.
If I'm not mistaken: - For the trivial version: n * (c1 + 2 * c2 + c4) + (x + y) * c3 - For the faster version: n / 2 * (c1 + 3 * c2 + c5) + ( x + y) * c3
Now, in my understanding, it is difficult to go further and speculate about how many CPU cycles each cN takes as it varies from CPU to CPU.
Let's check how many commands, branches, and branch skips happened on my machine, which is basically idle while each algorithm is executed in each warm cache test case (note that I checked each case 3 times to make sure that no significant deviation):
Random case
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _random
time: 1.547087
count: 300000000
min: 31499144704
max: 9223372004409096866
Performance counter stats for './findminmax_stackoverflow trivial _random':
1,083,101,126 branches
52,388 branch-miss
4,335,175,257 instructions # 0.00 insns per cycle
1.623851849 seconds time elapsed
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _random
time: 2.748967
count: 300000000
min: 31499144704
max: 9223372004409096866
Performance counter stats for './findminmax_stackoverflow faster _random':
783,120,927 branches
75,063,008 branch-miss
3,735,286,264 instructions # 0.00 insns per cycle
1.824884443 seconds time elapsed
Note that we have fewer instructions for the faster version, but nevertheless it will take a lot longer, possibly due to the number of branch skips in order or magnitude!
Best case
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _bestcase
time: 1.267697
count: 300000000
min: 0
max: 300000000
Performance counter stats for './findminmax_stackoverflow trivial _bestcase':
1,082,801,759 branches
49,802 branch-miss
4,334,200,448 instructions # 0.00 insns per cycle
1.343425753 seconds time elapsed
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _bestcase
time: 0.957440
count: 300000000
min: 0
max: 300000000
Performance counter stats for './findminmax_stackoverflow faster _bestcase':
782,844,232 branches
49,768 branch-miss
3,734,103,167 instructions # 0.00 insns per cycle
1.035189822 seconds time elapsed
Worst case
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _worstcase
time: 7.860047
count: 300000000
min: 1
max: 299999999
Performance counter stats for './findminmax_stackoverflow trivial _worstcase':
1,490,947,270 branches
2,127,876 branch-miss
7,159,600,607 instructions # 0.00 insns per cycle
6.916856158 seconds time elapsed
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _worstcase
time: 7.616476
count: 300000000
min: 1
max: 299999999
Performance counter stats for './findminmax_stackoverflow faster _worstcase':
1,196,744,455 branches
2,024,250 branch-miss
6,594,182,002 instructions # 0.00 insns per cycle
6.675068846 seconds time elapsed
So, very interestingly, the random case is actually faster than the worst case, which doesn't show much of a difference. The only difference I can see is that my worst case contains "small" numbers (which can be encoded into 32 bits), while the random case contains true 64 bit numbers.
Random case with "small" integers
So, try with a bunch of "small" random numbers (still 64-bit encoded):
$ od -N 64 -t dL _randomsmall
0000000 1418331637 2076047555
0000020 22077878 1677970822
0000040 1845481621 609558726
0000060 1668260452 335112094
0000100
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _randomsmall
time: 7.682443
count: 300000000
min: 9
max: 2147483647
Performance counter stats for './findminmax_stackoverflow trivial _randomsmall':
1,481,062,942 branches
2,564,853 branch-miss
6,223,311,378 instructions # 0.00 insns per cycle
6.739897078 seconds time elapsed
$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _randomsmall
time: 7.772994
count: 300000000
min: 9
max: 2147483647
Performance counter stats for './findminmax_stackoverflow faster _randomsmall':
1,177,042,675 branches
77,686,346 branch-miss
5,607,194,799 instructions # 0.00 insns per cycle
6.834074812 seconds time elapsed
So, as I guessed, small numbers are actually slower to process than large numbers, even if they all contain 64-bit words. There are a lot more "small" numbered branch skips, for the reason that perhaps only processor designers could say :-).
Another noteworthy thing is that often the elapsed time measured by perf (1) is sometimes less than that measured by the program itself. I think this is because the program itself uses real time, whereas perf (1) uses process time (the time that the process actually runs). I tried to use getrusage (2), but the time I am getting here does not match (for example, I get 1.6s as user time and 1.4s as system time, whereas perf (1) measures 6.8s).
Conclusion
- There is not much practical difference between the two algorithms in terms of runtime, although the trivial one has far fewer cache misses (by an order of magnitude) than the "faster" one, but this seems to be balanced out by the increased instruction sum (10-20%);
- More specifically, this difference in cache misses can only be seen in a random case: the best and worst cases seem to degenerate in this regard, since they lead to the same number of cache misses for both algorithms; therefore, the "faster" algorithm is actually weaker than the trivial one in these cases; in the "general" random case, the trivial algorithm is slightly faster;
- Small integers generate many more cache misses on my processor;
- There is no real difference between worst case and random test case with small int.
So, if you've come this far, thanks :). Sorry, although all this experimentation has only led to a vague conclusion. I hope the enlightened readers will tell about it :).
- Jeremy
source to share