Find both min and max together: the algorithm should be faster, but not

I am trying to implement an algorithm to find the minimum and maximum values ​​among many lengths in a file. My test file is one billion in length.

The algorithm works as expected, but doesn't run faster than the naive version. This should be significantly faster as the naive version does roughly 2n comparisons whereas this version does 3n / 2 comparisons.

$ time ./findminmax_naive somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m24.156s
user    0m4.956s
sys     0m3.896s

$ time ./findminmax_faster somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m25.048s
user    0m6.948s
sys     0m3.980s

      

Here's the "naive" version:

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
            perror("fopen");
        readcount = 1024;
        if (ac == 3)
            readcount = atol(av[2]);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i++) {
                        count++;
                        if (n[i] < min)
                            min = n[i];
                        if (n[i] > max)
                            max = n[i];
                }
                if (feof(f))
                        break;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

      

Here is the code for the "future" version "faster":

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
                perror("fopen");
        readcount = 1024;
        if (ac == 3)
                readcount = atol(av[2]);
        readcount = (readcount + 1) & (-1 << 1);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i += 2) {
                        count += 2;
                        if (n[i] < n[i + 1]) {
                                if (n[i] < min)
                                        min = n[i];
                                if (n[i + 1] > max)
                                        max = n[i + 1];
                        } else {
                                if (n[i + 1] < min)
                                        min = n[i + 1];
                                if (n[i] > max)
                                        max = n[i];
                        }
                }
                if (feof(f))
                        break;
        }
        if (rlen % 2) {
                if (n[rlen - 1] < min)
                        min = n[rlen - 1];
                if (n[rlen - 1] > max)
                        max = n[rlen - 1];
                count++;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

      

Do you know what I missed?

Thank you for your help.

- Jeremy

+1


source to share


4 answers


The key is branch prediction. If the file is not sorted in the pathological worst-case order, the naive version will execute 2n branches, which are predicted correctly almost every time. Your smart version does n / 2 branches that are almost never predicted correctly, and an additional n comparisons that can be predicted correctly.

How much mispredicted branch costs are highly dependent on the processor architecture and even the particular processor model, but at least I expect a mispredicted branch to cost several times more than a correctly predicted one. As a last resort, correctly predicted branches can have an effective zero cycle cost.

As an interesting example, I recently experimented with optimizations strlen

and found that individually the extremely naive expanded strlen

- comparing and branching one byte at a time - was faster than clever vectorized approaches. This is almost certainly because it strlen

has the special property that every branch to the last will always be correctly predicted.



By the way, to test my hypothesis, try this input pattern:

999999999, 1000000001, 999999998, 1000000002, 999999997, 1000000003, ...

This will give the worst case prediction for the naive algorithm and the best case for the outer conditional on your smart version.

+4


source


as @chr said, "The I / O file will overshadow any optimization in the algorithm itself."

Also, less comparison does not mean less time consumption . These two algorithms have O (n) time complexity that ignores actual operator overhead and abstract overhead.

For example, as two rough frames of these two algorithms, the time consumption is the time of all operating costs in your program.

For example:

//max and min initlaized as 0.
//c1,... reprents the time cost of each instruction.
while(i<count) {//c1
    if(a[i]>max)  //c2
        max =  a[i]; //c3
    i++;    //c4
}
//search of min is like below

      

cost of time:

T1 = 2n * c1 + 2n * c2 + x * c3 + y * c3 + 2n * c4 = 2n * (c1 + c2 + c4) + (x + y) * c3

which x and y match the order of your data.



And, comparison (3/2) n,

while(i<count)  //c1 
    if(a[i]<a[i+1]) {//c5
        if(a[i]<min) //c2
            min = a[i]; //c3
        if(a[i+1>max]) //c2
            max = a[i+1]; //c3
    }
    else
        ...
        //same as below,that swap i and i+1
    i+=2; //c6
}

      

cost of time:

T2 = n * c1 + n * c5 + n * 2 * c2 + (x '+ y') * c3 + n * c6 = n * (c1 + c5 + c6) + 2n * c2 + (x '+ y' ) * c3

if max and min are the first two items of your data, x = x '= 1; y = y '= 1.

T1-T2 = n * c1 + 2n * c4-n * c5 -n * c6. to differentiate the complier, T1-T2 may be different.

More complicated is that x, y, x ', y' are variables, but I will not continue to discuss this. My goal is to show what the real value of labor time is.

If you are still confused after reading this above, you should read Chapter 2.2 "Introduction to Algorithms".

+1


source


I can think of two things:

  • He's playing hell with a branch diviner.
  • The naive version is automatically generated by the compiler, but the smart version is not.
0


source


First of all, excuse me for answering all questions in one answer. I know I shouldn't be doing this on stackoverflow.com, but given that different topics are more or less intertwined, this is easier.

Introduction

So here's the code I'm currently using to test the algorithm. Differences with previous versions:

  • it includes both versions that you choose with the command line argument;
  • @chr, @Dukeling: it mmap (2) s file to prevent system or library calls;
  • @chr, @Dukeling: it has a "warm-up" option to eliminate all pages in memory before running the selected algorithm;
  • @Dukeling: The program will record the time taken by the algorithm alone using gettimeofday (2).

Here is the code:

#include <sys/mman.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define ROUNDUP(x, y)   ((((x) + ((y) - 1)) / (y)) * (y))
#define USCARRY(x)      ((x) < 0 ? (x) + 1000000 : (x))

int
main(int ac, char *av[])
{
        struct stat st;
        struct timeval start, end;
        long count, min, max, pgsiz;
        long *n, *p, *endp;
        int fd, warmup;

        if (ac != 3 && ac != 4) {


              fprintf(stderr, "Usage: %s <\"trivial\"|\"faster\"> "
                    "<file> [warmup]\n", av[0]);
                exit(1);
        }

        fd = open(av[2], O_RDONLY);
        if (fd == -1)
                perror("open");
        fstat(fd, &st);
        pgsiz = sysconf(_SC_PAGESIZE);
        n = mmap(NULL, ROUNDUP(st.st_size, pgsiz), PROT_READ,
            MAP_SHARED, fd, 0);
        if (n == MAP_FAILED)
                perror("mmap");
        warmup = 0;
        if (ac == 4)
                warmup = atoi(av[3]);
        // warm up the filesystem cache
        count = st.st_size / sizeof (*n);
        endp = &n[count - 1];
        //printf("%zu\n", sizeof (*p));
        //printf("%zu\n", sizeof (count));
        //exit(0);
        if (warmup) {
                for (p = n; p <= endp; p++) {
                        fwrite(p, sizeof (*p), 1, stdout);
                        min = *p;
                }
        }
        // start algorithm
        gettimeofday(&start, NULL);
        if (!strcmp(av[1], "trivial")) {
                min = max = n[0];
                p = &n[1];
                while (p <= endp) {                     // c1 * n
                        if (p[0] < min)                 // c2 * n
                                min = p[0];             // c3 * x
                        if (p[0] > max)                 // c2 * n
                                max = p[0];             // c3 * y
                        p++;                            // c4 * n
                }                       
        } else if (!strcmp(av[1], "faster")) {
                min = max = n[0];
                p = &n[1];
                while (p < endp) {                      // c1 * n/2
                        if (p[0] < p[1]) {              // c2 * n/2
                                if (p[0] < min)         // c2 * n/4
                                        min = p[0];     // c3 * x/2
                                if (p[1] > max)         // c2 * n/4
                                        max = p[1];     // c3 * y/2
                        } else {
                                if (p[1] < min)         // c2 * n/4
                                        min = p[1];     // c3 * x/2
                                if (p[0] > max)         // c2 * n/4
                                        max = p[0];     // c3 * y/2
                        }
                        p += 2;                         // c5 * n
                }                                       
                if (p == endp) {
                        if (*endp < min)
                                min = *endp;
                        else if (*endp > max)
                                max = *endp;
                }
        } else {
                printf("sorry\n");
                exit(1);
        }
        gettimeofday(&end, NULL);
        printf("time: %ld.%ld\n", end.tv_sec - start.tv_sec,
            USCARRY(end.tv_usec - start.tv_usec));
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

      

Test cases

Here are the files I'm using for the test case:

$ ls -l _*
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:37 _bestcase
-rw-r--r-- 1 jlh jlh 2400000000 May 27 08:40 _random
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:38 _worstcase

$ od -N 64 -t dL _bestcase 
0000000                    0            300000000
0000020                    1            299999999
0000040                    2            299999998
0000060                    3            299999997
0000100

$ od -N 64 -t dL _random
0000000  8600270969454287374  8436406000964321037
0000020  7348498334162074664  2050332610730417325
0000040  8183771519386464726  4134520779774161130
0000060  2045475555785071180  2859007060406233926
0000100

$ od -N 64 -t dL _worstcase 
0000000            150000000            150000000
0000020            149999999            150000001
0000040            149999998            150000002
0000060            149999997            150000003
0000100

      

I / O penalty

Okay, first open the cache and make sure there is no missing page that could mess up the results:

$ ./findminmax trivial _random 
time: 3.543573
count: 300000000
min: 31499144704
max: 9223372004409096866

$ ./findminmax trivial _random 
time: 1.466323
count: 300000000
min: 31499144704
max: 9223372004409096866

$ perf stat -e  minor-faults,major-faults ./findminmax trivial _random 
time: 1.284729
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax trivial _random':

           586,066 minor-faults                                                
                 0 major-faults                                                

       1.350118552 seconds time elapsed

      

So, you can see that there were no major page errors. From now on, we can agree that there will be no I / O impact. 2. Number of teams

Disadvantages of teams and missing branches

@ H2CO3, @vvy, you are absolutely correct that other instructions count as well (I will assume that each operation takes the same number of cpu cycles here, and will live in case of cpu cache failure). I don't know why the literary man I've read so far about algorithms only focuses on the number of comparisons (well, I admit I haven't read much though;)).

I have commented out each step in the loop with my own average case calculation (I think the worst case is not interesting here) and this is slightly different for yours.

If I'm not mistaken: - For the trivial version: n * (c1 + 2 * c2 + c4) + (x + y) * c3 - For the faster version: n / 2 * (c1 + 3 * c2 + c5) + ( x + y) * c3

Now, in my understanding, it is difficult to go further and speculate about how many CPU cycles each cN takes as it varies from CPU to CPU.

Let's check how many commands, branches, and branch skips happened on my machine, which is basically idle while each algorithm is executed in each warm cache test case (note that I checked each case 3 times to make sure that no significant deviation):

Random case

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _random
time: 1.547087
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax_stackoverflow trivial _random':

     1,083,101,126 branches                                                    
            52,388 branch-miss

     4,335,175,257 instructions              #    0.00  insns per cycle        

       1.623851849 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _random
time: 2.748967
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax_stackoverflow faster _random':

       783,120,927 branches                                                    
        75,063,008 branch-miss

     3,735,286,264 instructions              #    0.00  insns per cycle        

       1.824884443 seconds time elapsed

      

Note that we have fewer instructions for the faster version, but nevertheless it will take a lot longer, possibly due to the number of branch skips in order or magnitude!

Best case

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _bestcase 
time: 1.267697
count: 300000000
min: 0
max: 300000000

 Performance counter stats for './findminmax_stackoverflow trivial _bestcase':

     1,082,801,759 branches                                                    
            49,802 branch-miss

     4,334,200,448 instructions              #    0.00  insns per cycle        

       1.343425753 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _bestcase 
time: 0.957440
count: 300000000
min: 0
max: 300000000

 Performance counter stats for './findminmax_stackoverflow faster _bestcase':

       782,844,232 branches                                                    
            49,768 branch-miss

     3,734,103,167 instructions              #    0.00  insns per cycle        

       1.035189822 seconds time elapsed

      

Worst case

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _worstcase 
time: 7.860047
count: 300000000
min: 1
max: 299999999

 Performance counter stats for './findminmax_stackoverflow trivial _worstcase':

     1,490,947,270 branches                                                    
         2,127,876 branch-miss

     7,159,600,607 instructions              #    0.00  insns per cycle        

       6.916856158 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _worstcase 
time: 7.616476
count: 300000000
min: 1
max: 299999999

 Performance counter stats for './findminmax_stackoverflow faster _worstcase':

     1,196,744,455 branches                                                    
         2,024,250 branch-miss

     6,594,182,002 instructions              #    0.00  insns per cycle        

       6.675068846 seconds time elapsed

      

So, very interestingly, the random case is actually faster than the worst case, which doesn't show much of a difference. The only difference I can see is that my worst case contains "small" numbers (which can be encoded into 32 bits), while the random case contains true 64 bit numbers.

Random case with "small" integers

So, try with a bunch of "small" random numbers (still 64-bit encoded):

$ od -N 64 -t dL _randomsmall 
0000000           1418331637           2076047555
0000020             22077878           1677970822
0000040           1845481621            609558726
0000060           1668260452            335112094
0000100

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _randomsmall 
time: 7.682443
count: 300000000
min: 9
max: 2147483647

 Performance counter stats for './findminmax_stackoverflow trivial _randomsmall':

     1,481,062,942 branches                                                    
         2,564,853 branch-miss

     6,223,311,378 instructions              #    0.00  insns per cycle        

       6.739897078 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _randomsmall 
time: 7.772994
count: 300000000
min: 9
max: 2147483647

 Performance counter stats for './findminmax_stackoverflow faster _randomsmall':

     1,177,042,675 branches                                                    
        77,686,346 branch-miss

     5,607,194,799 instructions              #    0.00  insns per cycle        

       6.834074812 seconds time elapsed

      

So, as I guessed, small numbers are actually slower to process than large numbers, even if they all contain 64-bit words. There are a lot more "small" numbered branch skips, for the reason that perhaps only processor designers could say :-).

Another noteworthy thing is that often the elapsed time measured by perf (1) is sometimes less than that measured by the program itself. I think this is because the program itself uses real time, whereas perf (1) uses process time (the time that the process actually runs). I tried to use getrusage (2), but the time I am getting here does not match (for example, I get 1.6s as user time and 1.4s as system time, whereas perf (1) measures 6.8s).

Conclusion

  • There is not much practical difference between the two algorithms in terms of runtime, although the trivial one has far fewer cache misses (by an order of magnitude) than the "faster" one, but this seems to be balanced out by the increased instruction sum (10-20%);
  • More specifically, this difference in cache misses can only be seen in a random case: the best and worst cases seem to degenerate in this regard, since they lead to the same number of cache misses for both algorithms; therefore, the "faster" algorithm is actually weaker than the trivial one in these cases; in the "general" random case, the trivial algorithm is slightly faster;
  • Small integers generate many more cache misses on my processor;
  • There is no real difference between worst case and random test case with small int.

So, if you've come this far, thanks :). Sorry, although all this experimentation has only led to a vague conclusion. I hope the enlightened readers will tell about it :).

- Jeremy

0


source







All Articles