Evaluation of GPU efficiency in FLOPS (CUDA SAMPLES)

Question

Evaluation of GPU efficiency in FLOPS (CUDA SAMPLES)

It seems to me that I do not quite understand the concept of FLOPS. The CUDA examples use the matrix multiplication example (0_Simple / matrixMul). In this example, the FLOP (floating point operation) number per matrix multiplication is calculated by the formula:

 double flopsPerMatrixMul = 2.0 * (double)dimsA.x * (double)dimsA.y * (double)dimsB.x;

So this means that in order to multiply a matrix A(n x m)

by B(m x k)

, we need to do: 2*n*m*k

floating point operations.

However, to calculate 1 element of the resulting matrix C (n x k)

, you need to perform the operations of multiplication m

and (m-1)

. Thus, the total number of operations (to calculate the elements n x k

) is m*n*k

multiplication and addition (m-1)*n*k

.

Of course, we could also set the number of complements to m*n*k

, and the total number of operations will be 2*n*m*k

, half of them are multiplications and half's complements.

But I think multiplication is more computationally expensive than adding. Why are these two types of operations confused? Is this always the case in computer science? How can you account for two different types of transactions?

Sorry for my English)

+3

c ++ flops cuda

Mikhail Genkin Dec 16 '14 at 17:16

source to share

1 answer

Jerry coffin · Accepted Answer · 2014-12-16T17:30:49+0000

The short answer is yes, they account for both multiplications and additions. Even though most floating point processors have a combined multiply / add operation, they still count the multiplication and add as two separate floating point operations.

This is part of why people have been complaining for decades that FLOPs are basically a meaningless measurement. To keep in mind even a little bit, you almost have to specify some specific code of the code for which you are measuring FLOP (for example, "Linpack gigaflops"). Even then, you sometimes need pretty tight control over things such as compiler optimizations to ensure that what you are measuring is indeed the speed of the machine, not the ability of the compiler to simply eliminate some operations.

Ultimately, it is about the factors that led to the formation of organizations to establish benchmarks and rules on how those benchmarks should be met and reported results (eg SPEC). Otherwise, it can be difficult to be absolutely certain that the results you see for two different processors are actually comparable in any meaningful way. Even so, comparisons can be difficult, but without such things they can become meaningless.

Evaluation of GPU efficiency in FLOPS (CUDA SAMPLES)

More articles: