PyOpenCL gets different results compared to numpy

I am trying to get started with pyOpenCL and GPGPU in general.

For under-dot product code, I get quite different results between GPU and CPU versions. What am I doing wrong?

A ~ 0.5% difference seems like a lot for floating point errors to account for the difference. The difference seems to increase with array size (~ 9e-8 relative difference with 10000 array size). Maybe it's a problem with combining results by blocks ...? In any case, the color confuses me.

I don't know if it matters: I am running this on a MacBook Air, Intel (R) Core (TM) i7-4650U with a 1.70GHz processor, with Intel HD 5000 graphics.

Thanks in advance.

import pyopencl as cl
import numpy

from   pyopencl.reduction import ReductionKernel
import pyopencl.clrandom  as cl_rand

ctx   = cl.create_some_context()
queue = cl.CommandQueue(ctx)

dot = ReductionKernel( ctx,                                 \
                       dtype_out   = numpy.float32,         \
                       neutral     = "0",                   \
                       map_expr    = "x[i]*y[i]",           \
                       reduce_expr = "a+b",                 \
                       arguments   = "__global const float *x, __global const float *y"
                       )

x = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
y = cl_rand.rand(queue, 100000000, dtype = numpy.float32)

x_dot_y     =       dot(x,y).get()        # GPU: array(25001304.0, dtype=float32)
x_dot_y_cpu = numpy.dot(x.get(), y.get()) # CPU:       24875690.0

print abs(x_dot_y_cpu - x_dot_y)/x_dot_y  # 0.0050496689740063489

      

+3


source to share


1 answer


The order in which the values ​​are decremented is likely to be very different between the two methods. Across large data sets, tiny floating point rounding errors can add up soon. There may also be other details about the underlying implementations that affect the precision of the result.

I ran your example code on my machine and got a similar difference in the end result (~ 0.5%). As a data point, you can implement a very simple point product in raw Python and see how different this is from the OpenCL result and from Numpy.

For example, you could add something simple to your example:

x_dot_y_naive = sum(a*b for a,b in zip(x.get(), y.get()))

      



Here are the results I get on my machine:

OPENCL: 25003466.000000
NUMPY:  24878146.000000 (0.5012%)
NAIVE:  25003465.601387 (0.0000%)

      

As you can see, the naive implementation is closer to the OpenCL version than Numpy. One explanation for this might be that the Numpy function dot

is likely using merged multiple addition (FMA) operations that will change the rounding time of intermediate results. Without any compiler options to say otherwise, OpenCL must be fully IEE-754 compliant and not use the faster FMA operations.

+2


source







All Articles