Differences in computation accuracy in single / multi-threaded modes (OpenMP)

Can anyone explain / understand the different calculation results in single / multi-thread mode?

Here is an example approx. calculating pi:

#include <iomanip>
#include <cmath>
#include <ppl.h>

const int itera(1000000000);

int main()
{
    printf("PI calculation \nconst int itera = 1000000000\n\n");

    clock_t start, stop;

    //Single thread
    start = clock();
    double summ_single(0);
    for (int n = 1; n < itera; n++)
    {
        summ_single += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
    };
    stop = clock();
    printf("Time single thread             %f\n", (double)(stop - start) / 1000.0);


    //Multithread with OMP
    //Activate OMP in Project settings, C++, Language
    start = clock();
    double summ_omp(0);
#pragma omp parallel for reduction(+:summ_omp)
    for (int n = 1; n < itera; n++)
    {
        summ_omp += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
    };
    stop = clock();
    printf("Time OMP parallel              %f\n", (double)(stop - start) / 1000.0);


    //Multithread with Concurrency::parallel_for
    start = clock();
    Concurrency::combinable<double> piParts;
    Concurrency::parallel_for(1, itera, [&piParts](int n)
    {
        piParts.local() += 6.0 / (static_cast<double>(n)* static_cast<double>(n)); 
    }); 

    double summ_Conparall(0);
    piParts.combine_each([&summ_Conparall](double locali)
    {
        summ_Conparall += locali;
    });
    stop = clock();
    printf("Time Concurrency::parallel_for %f\n", (double)(stop - start) / 1000.0);

    printf("\n");
    printf("pi single = %15.12f\n", std::sqrt(summ_single));
    printf("pi omp    = %15.12f\n", std::sqrt(summ_omp));
    printf("pi comb   = %15.12f\n", std::sqrt(summ_Conparall));
    printf("\n");

    system("PAUSE");

}

      

And the results:

PI calculation VS2010 Win32
Time single thread 5.330000
Time OMP parallel 1.029000
Time Concurrency:arallel_for 11.103000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592651497


PI calculation VS2013 Win32
Time single thread 5.200000
Time OMP parallel 1.291000
Time Concurrency:arallel_for 7.413000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592647841


PI calculation VS2010 x64
Time single thread 5.190000
Time OMP parallel 1.036000
Time Concurrency::parallel_for 7.120000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592649319


PI calculation VS2013 x64
Time single thread 5.230000
Time OMP parallel 1.029000
Time Concurrency::parallel_for 5.326000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592648489

      

Tests were carried out on AMD and Intel processors, Win 7 x64.

What is the reason for the difference between PI computation in single and multi-core? Why is the computation result with Concurrency :: parallel_for not constant across assemblies (compiler, 32/64 bit platform)?

PS Visual Studio Express does not support OpenMP.

+3
multithreading parallel-processing precision visual-c ++ openmp


source to share


3 answers


Floating point addition is a non-associative operation due to rounding errors, so the order of the operations makes sense. Having a parallel program gives different results than a serial version - that's something normal. Understanding and combating it is part of the art of writing (portable) parallel code. This is compounded in 32-bit assemblies because in 32-bit mode the VS compiler uses x87 instructions and the x87 FPU performs all operations with an internal precision of 80 bits. In 64-bit mode, SSE math is used.

In the sequential case, one thread computes s 1 + s 2 + ... + s N, where N is the number of terms in the expansion.

In the case of OpenMP, there are n partial sums, where n is the number of OpenMP threads. Which terms fall into each partial sum depends on how the iterations are allocated between threads. By default, many OpenMP implementations use static scheduling, which means that thread 0 (main thread) computes ps 0= s 1 + s 2 + ... + s N / n; thread 1 computes ps 1= s N / n + 1 + s N / n + 2 + ... + s 2N / psub>; etc. As a result, the reduction somehow combines these partial sums.



The case is parallel_for

very similar to OpenMP. The difference is that by default the iterations are distributed dynamically - see the documentation for auto_partitioner

, so each partial sum contains more or less random selection of terms. This not only gives a slightly different result, but also gives a slightly different result with each execution, i.e. The result from two consecutive ones parallel_for

with the same number of threads may differ slightly. If you replace the delimiter with an instance simple_partitioner

and set the block size equal itera / number-of-threads

, you should get the same result as in the case of OpenMP if the reduction is the same.

You can use Kahan summation and implement your own reduction using Kahan summation as well. Then the parallel codes should lead to the same (much more similar) result as to the serial one.

+6


source to share


I would suggest that the parallel reduction that openmp does is generally more accurate, as the floating point rounding error becomes more distributed. In general, floating point abbreviations are problematic due to round-off errors etc. http://floating-point-gui.de/ executing these operations in parallel is a way to improve precision by spreading the round-off error. Imagine doing a large reduction, at some point the battery will grow in size compared to other values ​​and this will increase the rounding error for each addition, since the range of the batteries is much larger and it might not be possible to represent a smaller value in this range is accurate, however, if there are multiple batteries for the same recovery working in parallel, their values ​​will be smaller and the error will be smaller.



+5


source to share


So ... In win32 mode, FPU with 80-bit registers will be used. In x64 mode, SSE2 double precision floating point (64 bit) will be used. Using sse2 is similar to the default in x64 mode.

In theory ... is it possible that the calculation in win32 mode will be more accurate? :) http://en.wikipedia.org/wiki/SSE2 So the best way is to buy new processors with AVX or compile to 32-bit code ?. ..

-2


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics