Nested Loop Implementation in OpenCL?
I'm new to OpenCL, trying to implement a 3-tier nested loop in Kernel functions. I think my understanding is not enough. Below is the C code of the logic
void scale(float *output, float *scales, int batch, int n, int size)
{
int i,j,b;
for(b = 0; b < batch; ++b){
for(i = 0; i < n; ++i){
for(j = 0; j < size; ++j){
output[(b*n+i)*size+j] *= scales[i];
}
}
}
}
Where output
and scales
are 1D arrays. Example:
float output[18] = {1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9};
float scales[9] = {1,0,1,0,1,0,1,0,1};
int n = 9;
int size = 2;
int batch = 1;
Expected output: Output:
1.000000 2.000000 0.000000 0.000000 5.000000 6.000000
0.000000 0.000000 9.000000 1.000000 0.000000 0.000000
4.000000 5.000000 0.000000 0.000000 8.000000 9.000000
Below is my OpenCL core
__kernel void scale_kernel(__global float *output, __global float *biases, int n, int size)
{
int j = get_global_id(0);
int i = get_group_id(1);
int b = get_group_id(2);
if(j < size) output[(b*n+i)*size + j] *= biases[i];
}
I hope this implementation is correct and the way I run NDkernel is wrong. The size of my BLOCK is 16 (think my understanding is wrong).
size_t global_work_size[3] = {size-1)/BLOCK + 1, n, batch};
size_t local_work_size[3] = {BLOCK, 1, 1};
cl.error = clEnqueueNDRangeKernel(queue, kernel, 3, 0, global_work_size, local_work_size, 0, 0, NULL);
EDIT 1:
Changing the global_work_size as shown below produces the expected output, I set the local_work_size to NULL in this case. This may not provide the best performance.
size_t global_work_size[3] = {size, n, batch};
cl.error = clEnqueueNDRangeKernel(queue, kernel, 3, 0, global_work_size, NULL, 0, 0, NULL);
Please let me know how to choose global_work_size, local_work_size.
source to share
No one has answered this question yet
See similar questions:
or similar: