What's the most efficient way to compute the id / lane warp in a one dimensional grid?

Question

What's the most efficient way to compute the id / lane warp in a one dimensional grid?

In CUDA, each thread knows its own block index in the table and the thread index inside the block. But two important values don't seem to be explicitly available to it:

Its index as a strip within its stem (its "strip id")
Warp index, which is the band within the block (its "warp id")

Assuming that the mesh is 1-dimensional (aka linear, that is, blockDim.y

and blockDim.z

are equal to 1), we can obviously get them as follows:

enum : unsigned { warp_size = 32 };
auto lane_id = threadIdx.x % warp_size;
auto warp_id = threadIdx.x / warp_size;

and if you don't trust the compiler to optimize it, you can rewrite it as:

enum : unsigned { warp_size = 32, log_warp_size = 5 };
auto lane_id = threadIdx.x & (warp_size - 1);
auto warp_id = threadIdx.x >> log_warp_size;

is it the most effective thing? It still seems like it takes a lot of waste for each stream to figure this out.

_{(inspired by this question .)}

+4

optimization cuda ptx

einpoklum 02 june 17 at 20:55

source to share

3 answers

The other answer is very dangerous ! Calculate lane ID and base ID.

#include <cuda.h>
#include <iostream>

inline __device__ unsigned get_lane_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
  return ret;
}

inline __device__ unsigned get_warp_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %warpid;" : "=r"(ret));
  return ret;
}

__global__ void kernel() {
  const int actual_warpid = get_warp_id();
  const int actual_laneid = get_lane_id();
  const int expected_warpid = threadIdx.x / 32;
  const int expected_laneid = threadIdx.x % 32;
  if (expected_laneid == 0) {
    printf("[warp:] actual: %i  expected: %i\n", actual_warpid,
           expected_warpid);
    printf("[lane:] actual: %i  expected: %i\n", actual_laneid,
           expected_laneid);
  }
}

int main(int argc, char const *argv[]) {
  dim3 grid(8, 7, 1);
  dim3 block(4 * 32, 1);

  kernel<<<grid, block>>>();
  cudaDeviceSynchronize();
  return 0;
}

which gives something like

[warp:] actual: 4  expected: 3
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
[warp:] actual: 12  expected: 1
[warp:] actual: 4  expected: 3
[warp:] actual: 0  expected: 0
[warp:] actual: 13  expected: 2
[warp:] actual: 12  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 13  expected: 2
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
...
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0

see also PTX docs

A predefined read-only special register that returns the warp thread ID. The warp identifier provides a unique warp number within the CTA, but not through the CTA in the grid. The warp identifier will be the same for all threads within the same framework.

Note that% warpid is volatile and returns the location of the stream at the time it was read, but its value can change at runtime, for example, due to reallocation of threads after a pause.

Hence, this is the scheduler's warp-id, with no guarantee that it matches the virtual warp-id (started counting from 0).

The docs make it clear :

For this reason,% ctaid and% tid should be used to compute the virtual warp index if such a value is needed in kernel code; % warpid is mainly intended to enable the profiling and diagnostic code to display and log information such as job mapping and load balancing.

If you think it's okay, let's use CUB for that: it even affects cub::WarpId()

Returns the warp id of the calling thread. The Warp ID is guaranteed to be unique among the skews, but may not match the ranked zero in the flow block.

EDIT: Usage %laneid

seems to be safe.

0

Patwie 11 oct. '18 at 8:42

source to share

@Patwie My experience is that lane id is not secure either. The difference between mine and your code is that my code depends on the IDs of all the lines in the flow block. My code did not return correct results when using% laneid. However, there is no problem with threadIdx.x% 32. There is one thread measurement unit specified in my code.

0

palebluedot Dec 18 '18 at 9:13

source to share

einpoklum · Accepted Answer · 2017-06-02T20:55:12+0000

Naive computation is currently the most efficient one.

Note. This answer has been heavily edited.

It is very tempting to try to avoid computation - as these two values seem to be already available if you look under the hood.

You see, nVIDIA GPUs have special registers that your (compiled) code can read to access various kinds of useful information. One such register contains threadIdx.x

; the other contains blockDim.x

; the other is the hour counter; etc. C ++, since the language does not have these open source, obviously; and, in fact, no CUDA. However, the intermediate representation into which the CUDA code is compiled, called PTX , exposes these special registers (since PTX 1.3, that is, with CUDA versions> = 2.1).

Two of these special registers are %warpid

and %laneid

. CUDA now supports embedding PTX code in CUDA code with the keyword asm

- just like it can be used for host-side code to directly emit CPU build instructions. With this mechanism, you can use these special registers:

__forceinline__ __device__ unsigned lane_id()
{
    unsigned ret; 
    asm volatile ("mov.u32 %0, %laneid;" : "=r"(ret));
    return ret;
}

__forceinline__ __device__ unsigned warp_id()
{
    // this is not equal to threadIdx.x / 32
    unsigned ret; 
    asm volatile ("mov.u32 %0, %warpid;" : "=r"(ret));
    return ret;
}

... but there are two problems here.

The first problem, as @Patwie suggests, is that it %warp_id

doesn't give you what you actually want - it's not the warp index in the context of the mesh, but rather in the context of the physical SM (which can hold so many restorers at once). and the two are not the same. So don't use%warp_id

.

As far %lane_id

as it goes , it gives you the correct value, but it is misleading %lane_id

: even if it is a "register", it is not like the regular registers in your 1-clock access delay register file. It is a special register that is retrievedS2R

in real hardware using an instruction that can exhibit long latency.

Bottom line: just compute the warp id and thread id from the thread id. We can't get around this - yet.

What's the most efficient way to compute the id / lane warp in a one dimensional grid?

Naive computation is currently the most efficient one.

More articles: