Improving the performance of matrix multiplication

Question

Improving the performance of matrix multiplication

This is my code to speed up matrix multiplication, but it is only 5% faster than the simple one. What can I do to make it as large as possible?

* Tables are accessed, for example, as: C [sub2ind (i, j, n)] for position C [i, j] .. p>

void matrixMultFast(float * const C,            /* output matrix */
                float const * const A,      /* first matrix */
                float const * const B,      /* second matrix */
                int const n,                /* number of rows/cols */
                int const ib,               /* size of i block */
                int const jb,               /* size of j block */
                int const kb)               /* size of k block */
{

int i=0, j=0, jj=0, k=0, kk=0;
float sum;

for(i=0;i<n;i++)
    for(j=0;j<n;j++)
        C[sub2ind(i,j,n)]=0;

for(kk=0;kk<n;kk+=kb)
{
    for(jj=0;jj<n;jj+=jb)
    {
        for(i=0;i<n;i++)
        {
            for(j=jj;j<jj+jb;j++)
            {
                sum=C[sub2ind(i,j,n)];
                for(k=kk;k<kk+kb;k++)
                    sum += A[sub2ind(i,k,n)]*B[sub2ind(k,j,n)];
                C[sub2ind(i,j,n)]=sum;
            }
        }
    }
}
} // end function 'matrixMultFast4'

* It is written in C and needs to support C99

+4

c c99 matrix matrix-multiplication

Kostas C. 05 june 17 at 18:10

source to share

1 answer

cs95 · Accepted Answer · 2017-06-05T18:44:28+0000

There are many, many things you can do to make your matrix multiplication more efficient.

To learn how to improve the basic algorithm, let's first take a look at our current parameters. The naive implementation, of course, has 3 loops with order time complexity O(n^3)

. There is another method, called Strassen's method, which achieves a noticeable speedup and has an order of magnitude O(n^2.73)

(but in practice it is useless because it does not offer noticeable optimization tools).

This is in theory. Now let's look at how matrices are stored in memory. The main row is the standard, but you can find the main column as well. Depending on the schema, transposing your matrix can improve speed due to fewer caches. Matrix multiplication, in theory, is simply a collection of vector-dot products and additions. The same vector must be handled by multiple vectors, so it makes sense to store that vector in the cache for faster access.

Now, with the introduction of multiple cores, concurrency and the concept of cache, the game is changing. If we look a little more closely, we see that the dot product is nothing more than a set of multiplications followed by sums. These multiplications can be done in parallel. Hence, we can now look at loading numbers in parallel.

Now let's make things a little more complicated. When talking about matrix multiplication, there is a difference between single floating point and double floating point in terms of their size. Often the first is 32 bits and the second is 64 (of course this depends on the processor). Each CPU only has a fixed number of registers, that is, the larger your digits, the less you can fit into the CPU. The moral of the story is this: stick with one floating point unless you really need a double.

Now that we've gone over the basics of setting up matrix multiplication, worry not. You don't need to do any of the things discussed above, as there are already routines for that. As mentioned in the comments, there is GotoBLAS, OpenBLAS, Intel MKL, and Apple Accelerate Framework. MKL / Accelerate are proprietary, but OpenBLAS is a very competitive alternative.

Here's a nice little example that multiplies 2 8k x 8k matrices in a few milliseconds on my Macintosh:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <Accelerate/Accelerate.h>

int SIZE = 8192;

typedef float point_t;

point_t* transpose(point_t* A) {    
    point_t* At = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));    
    vDSP_mtrans(A, 1, At, 1, SIZE, SIZE);

    return At;
}

point_t* dot(point_t* A, point_t* B) {
    point_t* C = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));       
    int i;    
    int step = (SIZE * SIZE / 4);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[0], SIZE, B, SIZE, 0.0, &C[0], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step], SIZE, B, SIZE, 0.0, &C[step], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 2], SIZE, B, SIZE, 0.0, &C[step * 2], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 3], SIZE, B, SIZE, 0.0, &C[step * 3], SIZE);      

    return C;
}

void print(point_t* A) {
    int i, j;
    for(i = 0; i < SIZE; i++) {
        for(j = 0; j < SIZE; j++) {
            printf("%f  ", A[i * SIZE + j]);
        }
        printf("\n");
    }
}

int main() {
    for(; SIZE <= 8192; SIZE *= 2) {
        point_t* A = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
        point_t* B = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));

        srand(getpid());

        int i, j;
        for(i = 0; i < SIZE * SIZE; i++) {
            A[i] = ((point_t)rand() / (double)RAND_MAX);
            B[i] = ((point_t)rand() / (double)RAND_MAX);
        }

        struct timeval t1, t2;
        double elapsed_time;

        gettimeofday(&t1, NULL);
        point_t* C = dot(A, B);
        gettimeofday(&t2, NULL);

        elapsed_time = (t2.tv_sec - t1.tv_sec) * 1000.0;      // sec to ms
        elapsed_time += (t2.tv_usec - t1.tv_usec) / 1000.0;   // us to ms

        printf("Time taken for %d size matrix multiplication: %lf\n", SIZE, elapsed_time/1000.0);

        free(A);
        free(B);
        free(C);

    }
    return 0;
}

At this point I should also mention SSE (Streaming SIMD Extension), which in principle you shouldn't be doing if you haven't worked with assembly. Basically, you are vectorizing your C code to work with vectors instead of integers. This means you can work with blocks of data instead of individual values. The compiler gives up and just translates your code as it is, without doing its own optimizations. Done right, it can speed up your code like nothing before - you can even touch the theoretical floor O(n^2)

! But SSE is easy to abuse, and unfortunately most people do that makes the end result worse than before.

I hope this encourages you to dig deeper. The world of matrix multiplication is big and exciting. Below are links for further reading.

Improving the performance of matrix multiplication

More articles: