Why is the call to array_view :: synchronize () so slow?
I started experimenting with C ++ AMP. I created a simple test application to see what it can do, however the results are pretty surprising to me. Consider the following code:
#include <amp.h>
#include "Timer.h"
using namespace concurrency;
int main( int argc, char* argv[] )
{
uint32_t u32Threads = 16;
uint32_t u32DataRank = u32Threads * 256;
uint32_t u32DataSize = (u32DataRank * u32DataRank) / u32Threads;
uint32_t* pu32Data = new (std::nothrow) uint32_t[ u32DataRank * u32DataRank ];
for ( uint32_t i = 0; i < u32DataRank * u32DataRank; i++ )
{
pu32Data[i] = 1;
}
uint32_t* pu32Sum = new (std::nothrow) uint32_t[ u32Threads ];
Timer tmr;
tmr.Start();
array< uint32_t, 1 > source( u32DataRank * u32DataRank, pu32Data );
array_view< uint32_t, 1 > sum( u32Threads, pu32Sum );
printf( "Array<> deep copy time: %.6f\n", tmr.Stop() );
tmr.Start();
parallel_for_each(
sum.extent,
[=, &source](index<1> idx) restrict(amp)
{
uint32_t u32Sum = 0;
uint32_t u32Start = idx[0] * u32DataSize;
uint32_t u32End = (idx[0] * u32DataSize) + u32DataSize;
for ( uint32_t i = u32Start; i < u32End; i++ )
{
u32Sum += source[i];
}
sum[idx] = u32Sum;
}
);
double dDuration = tmr.Stop();
printf( "gpu computation time: %.6f\n", dDuration );
tmr.Start();
sum.synchronize();
dDuration = tmr.Stop();
printf( "synchronize time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
tmr.Start();
for ( uint32_t idx = 0; idx < u32Threads; idx++ )
{
uint32_t u32Sum = 0;
for ( uint32_t i = 0; i < u32DataSize; i++ )
{
u32Sum += pu32Data[(idx * u32DataSize) + i];
}
pu32Sum[idx] = u32Sum;
}
dDuration = tmr.Stop();
printf( "cpu computation time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
delete [] pu32Sum;
delete [] pu32Data;
return 0;
}
Note that Timer
this is a simple temporary class using QueryPerformanceCounter. In any case, the output of the code is as follows:
Array<> deep copy time: 0.089784
gpu computation time: 0.000449
synchronize time: 8.671081
first and second row sum = 1048576, 1048576
cpu computation time: 0.006647
first and second row sum = 1048576, 1048576
Why is the call to synchronize () taking so long? Is there a way to get around this? Also, the computational performance is amazing, however the overhead of synchronize () makes this unusable for me.
It is also possible that I am doing something terrible, if so, please tell me. Thanks in advance.
source to share
The synchronize () function is probably taking that long because it waits for the actual core to complete its work.
From parallel_for_each from amp.h :
Note that parallel_for_each runs as synchronous with the calling code, but it is actually asynchronous. That is, as soon as a call to parallel_for_each is made and the kernel is passed at runtime, [the code after parallel_for_each] continues to be executed immediately by the CPU thread, while the kernel is executed in parallel by the GPU threads.
So the measurement of the time spent in parallel_for_each doesn't really matter.
EDIT: The way the algorithm is written, it won't do much GPU acceleration. Reading source [i] is not coalesced and will therefore be almost 16 times slower than coalesced read. It is possible to combine reads using shared memory, but this is not entirely trivial. I would recommend reading on GPU programming.
If you just want a simple example to demonstrate the usefulness of C ++ AMP, try matrix multiplication .
Of course, the performance you will see is also highly dependent on your GPU hardware model.
source to share
In addition to Igor's answer to your specific algorithm, please note that there are several wrong aspects of how you evaluate the performance of C ++ AMP in general (no execution initialization exception, no original JIT discarding, no data thawing, and the already mentioned assumption that p_f_e is synchronous), so please follow our guidelines here:
source to share