The speed of the two algorithms rotating the sequence. (from the book "Programming Pearls")

In column 2 of the book "Programming Pearls" there is a problem with the request to develop an algorithm for rotating a string k positions to the left. For example, string "12345" and k = 2, then the result is "34512".

The first algorithm is to simulate the exchange process, i.e. put x [(i + k)% n] in x [i] and repeat to completion.

The second algorithm uses the observation that we only need to swap a = "12" and b = "345", that is, the first k characters and the last n - k characters. We could first flip a to '= "21" and b to b' = "543 'and then reverse (a'b') 'to ba, which is desirable.

Below is my code:

Algorithm 1:

#define NEXT(j) ((j + k) % n)
#define PREV(j) ((j + n - k) % n)

#include "stdio.h"
#include "stdlib.h"

int gcd(int a, int b) {
    return (a % b == 0 ? b : gcd(b, a % b));
}

void solve(int *a, int n, int k) {
    int len = gcd(n, k);
    for (int i = 0; i < len; i++) {
        int x = a[i];
        int j = i;
        do {
            a[j] = a[NEXT(j)];
            j = NEXT(j);
        } while (j != i);
        a[PREV(j)] = x;
    }
}

int main(int argc, char const *argv[])
{
    int n, k;
    scanf("%d %d", &n, &k);

    int *a = malloc(sizeof(int) * n);
    for (int i = 0; i < n; i++) a[i] = i;

    solve(a, n, k);

    free(a);

    return 0;
}

      

Algorithm 2:

#include "stdio.h"
#include "stdlib.h"

void swap(int *a, int *b) {
    int t = *a;
    *a = *b;
    *b = t;
}

void reverse(int *a, int n) {
    int m = n / 2;
    for (int i = 0; i < m; i++) {
        swap(a + i, a + (n - 1 - i));
    }
}

void solve(int *a, int n, int k) {
    reverse(a, k);
    reverse(a + k, n - k);
    reverse(a, n);
}

int main(int argc, char const *argv[])
{
    int n, k;
    scanf("%d %d", &n, &k);

    int *a = malloc(sizeof(int) * n);
    for (int i = 0; i < n; i++) a[i] = i;

    solve(a, n, k);

    free(a);

    return 0;
}

      

where n is the string length and k is the rotation length.

I am using n = 232830359 and k = 80829 to test two algorithms. As a result, Algorithm 1 takes 6.199 s, while Algorithm 2 takes 1.970 s.

However, I think the two algorithms need to compute n exchanges. (Algorithm 1 is obvious, Algorithm 2 takes k / 2 + (nk) / 2 + n / 2 = n exchanges).

My question is, why are their speeds so different?

+3


source to share


1 answer


Both of these algorithms are more memory related than CPU related. That is why when analyzing the number of basic operations (for example, swaps or cycles), results are given that are very different from the real time of work. Therefore, we will use the external memory model instead of the RAM model. That is, we will analyze the number of cache misses. Suppose N

is the size of the array, M

is the number of blocks in the cache, and B

is one block size. While N

large in your test, it's safe to assume that N

> M

(that is, the entire array cannot be in the cache).

1) First algorithm: it refers to the elements of the array as follows i

, (i + k) mod N

, (i + 2 * k) mod N

and so on. If it is k

large, then two consecutively accessible elements are not in the same block. So in the worst case, two accesses result in two cache misses. These two blocks will be loaded into the cache, but after that they may not be used for a long time! So when they are accessed again, they can be replaced with other blocks (since the cache is smaller than the array). And again it will be a mistake. It can be shown that in the worst case this algorithm can have O(N)

cache leaks.

2) The second algorithm has a very different pattern array access: l, r, l + 1, r - 1, ...

. If access to the element l

-th causes an error, the entire block with it is loaded into the cache, so access to l + 1

, l + 2

... to the end of the block does not lead to errors. The same is true for r

, r - 1

etc. (This is true only if the lock l

and r

can be stored in the cache at the same time, but it's a safe assumption, since caches are usually not directly displayed). So this algorithm has O(N / B)

worst case cache misses.



Considering that the real cache block size is larger than one integer size, it becomes clear why the second algorithm is significantly faster.

PS This is just a model of what is really going on, but in this particular case the external memory model performs better than the RAM model (and the RAM model is just a model too).

+1


source







All Articles