OpenCl matrix transformation with memory pooling

I am currently trying to convert a matrix to OpenCl with memory coalescing.

I've already set up the matrix in a "simple" way, which worked great. When I tried to do the same thing now with memory pooling, I was hoping to see a slight improvement at runtime, but my implementation is actually slower than the simple implementation (the implementation is correct, it's just inefficient). I think I haven't quite figured out how to ensure that horizontal adjacent work items are written at horizontally adjacent addresses.

Here is the core for my coalition implementation:

__kernel void MatrixTranspose(__global const float* Matrix, 
__global float* MatrixTransposed, uint Width, uint Height, __local float* block) {

    int2 GlobalID;
    GlobalID.x = get_global_id(0);
    GlobalID.y = get_global_id(1);

    int2 LocalID;
    LocalID.x = get_local_id(0);
    LocalID.y = get_local_id(1);

    block[LocalID.y*get_local_size(0) + LocalID.x] = Matrix[GlobalID.y*Width + GlobalID.x];

    barrier(CLK_LOCAL_MEM_FENCE);

    int2 groupId;
    groupId.x = get_group_id(0);
    groupId.y = get_group_id(1);
    int2 localSize;
    localSize.x = get_local_size(0);
    localSize.y = get_local_size(1);
    MatrixTransposed[Height*(LocalID.x + groupId.x*localSize.x) + Height - (LocalID.y + groupId.y*localSize.y) - 1] = block[LocalID.y*localSize.x + LocalID.x];
}

      

I hope someone can give me some advice, thanks :)

+3


source to share


2 answers


Unfortunately, you will be bound by your global device read and write speed. Usually, you carry the matrix around for calculation, and this helps to hide the delay. You are reading local memory, expecting a barrier, and writing black to global in your example. This adds an extra step and complexity to using local memory.

You have to do something with the data while in local memory if you want to hide the global memory latency.

If all you want to do is transpose the matrix, just read it from the global and write it straight away to the target location in the global. Maybe take a look at async_work_group_copy if you still want to try using local memory.

Now for my answer.



Try to create a work item that is responsible for more than one float. If you are reading a 4x4 area with a work item, you can transfer it to your personal memory. This will not only leak local memory, but it will also remove the need for a barrier and reduce the number of work items you need by 16x.

actions:

  • compute src and dest addresses of global memory
  • load four float4 values ​​from global
  • transpose the 4x4 floats by replacing their w, x, y, z values ​​respectively.
  • store 4 float4 values ​​in a new location in global memory
  • handle edge regions of a matrix in a separate kernel or in the main program for matrices with fuzzy four dimensions (or overlay your input matrix to make it a multiple of 4).
+2


source


Read the columns from the original matrix, write them as rows in local memory to avoid memory bank conflicts, and then store the rows from local memory into the transposed matrix.

In this case, the two write operations are easily combined (adjacent work items are written to adjacent memory cells). However, read operations are not that good.



By the way, what device do you have? If it is good with vector operations, use vload / vstore operations, it can significantly improve I / O performance.

+1


source







All Articles