Why is the call to array_view :: synchronize () so slow?

Question

Why is the call to array_view :: synchronize () so slow?

I started experimenting with C ++ AMP. I created a simple test application to see what it can do, however the results are pretty surprising to me. Consider the following code:

#include <amp.h>
#include "Timer.h"

using namespace concurrency;

int main( int argc, char* argv[] )
{
    uint32_t u32Threads = 16;
    uint32_t u32DataRank = u32Threads * 256;
    uint32_t u32DataSize = (u32DataRank * u32DataRank) / u32Threads;
    uint32_t* pu32Data = new (std::nothrow) uint32_t[ u32DataRank * u32DataRank ];

    for ( uint32_t i = 0; i < u32DataRank * u32DataRank; i++ )
    {
        pu32Data[i] = 1;
    }

    uint32_t* pu32Sum = new (std::nothrow) uint32_t[ u32Threads ];

    Timer tmr;

    tmr.Start();

    array< uint32_t, 1 > source( u32DataRank * u32DataRank, pu32Data ); 
    array_view< uint32_t, 1 > sum( u32Threads, pu32Sum );

    printf( "Array<> deep copy time: %.6f\n", tmr.Stop() );

    tmr.Start();

    parallel_for_each( 
        sum.extent,
        [=, &source](index<1> idx) restrict(amp)
        {
            uint32_t u32Sum = 0;
            uint32_t u32Start = idx[0] * u32DataSize;
            uint32_t u32End = (idx[0] * u32DataSize) + u32DataSize;
            for ( uint32_t i = u32Start; i < u32End; i++ )
            {
                u32Sum += source[i];
            }
            sum[idx] = u32Sum;
        }
    );

    double dDuration = tmr.Stop();
    printf( "gpu computation time: %.6f\n", dDuration );

    tmr.Start();

    sum.synchronize();

    dDuration = tmr.Stop();
    printf( "synchronize time: %.6f\n", dDuration );
    printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );

    tmr.Start();

    for ( uint32_t idx = 0; idx < u32Threads; idx++ )
    {
        uint32_t u32Sum = 0;
        for ( uint32_t i = 0; i < u32DataSize; i++ )
        {
            u32Sum += pu32Data[(idx * u32DataSize) + i];
        }
        pu32Sum[idx] = u32Sum;
    }

    dDuration = tmr.Stop();
    printf( "cpu computation time: %.6f\n", dDuration );
    printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );

    delete [] pu32Sum;
    delete [] pu32Data;

    return 0;
}

Note that Timer

this is a simple temporary class using QueryPerformanceCounter. In any case, the output of the code is as follows:

Array<> deep copy time: 0.089784
gpu computation time: 0.000449
synchronize time: 8.671081
first and second row sum = 1048576, 1048576
cpu computation time: 0.006647
first and second row sum = 1048576, 1048576

Why is the call to synchronize () taking so long? Is there a way to get around this? Also, the computational performance is amazing, however the overhead of synchronize () makes this unusable for me.

It is also possible that I am doing something terrible, if so, please tell me. Thanks in advance.

+3

c ++ - amp

PeterK 23 Mar 12 at 18:23

source to share

2 answers

In addition to Igor's answer to your specific algorithm, please note that there are several wrong aspects of how you evaluate the performance of C ++ AMP in general (no execution initialization exception, no original JIT discarding, no data thawing, and the already mentioned assumption that p_f_e is synchronous), so please follow our guidelines here:

http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx

+3

Daniel Moth 27 Mar 12 at 1:48

source to share

Igor ostrovsky · Accepted Answer · 2012-03-23T22:24:40+0000

The synchronize () function is probably taking that long because it waits for the actual core to complete its work.

From parallel_for_each from amp.h :

Note that parallel_for_each runs as synchronous with the calling code, but it is actually asynchronous. That is, as soon as a call to parallel_for_each is made and the kernel is passed at runtime, [the code after parallel_for_each] continues to be executed immediately by the CPU thread, while the kernel is executed in parallel by the GPU threads.

So the measurement of the time spent in parallel_for_each doesn't really matter.

EDIT: The way the algorithm is written, it won't do much GPU acceleration. Reading source [i] is not coalesced and will therefore be almost 16 times slower than coalesced read. It is possible to combine reads using shared memory, but this is not entirely trivial. I would recommend reading on GPU programming.

If you just want a simple example to demonstrate the usefulness of C ++ AMP, try matrix multiplication .

Of course, the performance you will see is also highly dependent on your GPU hardware model.

Why is the call to array_view :: synchronize () so slow?

More articles: