Efficient intersection of all pairs on the GPU

I have n

sets, subsets of the finite universe. I want to compute a matrix n*n

where the entry (I, J)

contains the cardinality of the intersection of the set I

and sets J

. n

has order 50000

.

My idea is to split the matrix into blocks small enough to have one stream for each entry. Each thread must compute the intersection with bitwise and

.

Are there better approaches to solving this problem?

+3


source to share


3 answers


I'm going to assume that you want to compute it as you described: by actually computing the intersection of each pair of sets using bitwise and bits.

With the right math setup, you literally compute the outer product of two vectors, so I'll be thinking about high performance linear algebra.

The key to performance will be reducing memory traffic, which means you can keep things in registers. Overwhelmingly, the most important factor is that your items are huge; it takes 6250 32-bit words to store one set! For example, an entire computational multiprocessor cuda 3.0 can only hold 10 sets in registers.

What you probably want to do is propagate each element to the entire flow block. With 896 threads per block and 7 registers per block, you can store one set of 200704 elements. With the cuda 3.0 computation capability, you have 36 registers available per block.



The simplest implementation would be for each block to have one row of the output matrix. It loads the corresponding element of the second vector and stores it in registers, and then iterates over all the elements of the first vector, calculating the intersection, calculating and decreasing the number of users, and then storing the result in the output vector.

This optimization should reduce the total amount of data read in 2 times and therefore can double the performance.

It would be better to have each block of its own 3-4 rows of the output matrix at once and load the corresponding 3-4 elements of the second vector into registers. The block then iterates over all the elements of the first register, and for each one, it calculates the 3-4 intersections it can, storing the result in the output matrix.

This optimization reduces memory traffic by an additional factor of 3-4.

+1


source


A completely different approach would be to work with each element of the universe individually: for each element of the universe, you calculate which sets actually contain that element, and then (atomically) increment the corresponding entries in the output matrix.

Asymptotically, this should be much more efficient than calculating set intersections. Unfortunately, this is difficult to implement efficiently.


The improvement is to work with, say, four elements of the universe at a time. You have divided all of your settings into 16 buckets, depending on which of the four items contains the set. Then, for each of the 16 * 16 possible bucket pairs, you iterate over all the vector pairs from the buckets and (atomically) update the corresponding matrix entry accordingly.



It should be even faster than the version above, but it can still be difficult to implement.


To reduce the difficulty of getting all the synchronization in progress, you can split all input sets into k

groups of n/k

. Then the (i,j)

-th thread (or warp or block) only performs updates for the corresponding block in the output matrix.

+1


source


Another approach to splitting the problem is to split the universe into smaller sections of 1,024 elements each and calculate only the size of the intersections in that part of the universe.

I'm not sure if I described it well; basically you compute

A[i,j] = sum((k in v[i]) * (k in w[j]) for k in the_universe)

      

where v

and w

are two lists of sets, and k in S

- 1

if true and 0

otherwise. The point is to rearrange the indexes so that they are k

in the outer loop and not in the inner loop, although for efficiency you will have to work with many sequential ones k

at the same time, rather than one at a time.

That is, you initialize the inference matrix to all zeros, and for each block of 1024 universe elements, you compute the intersection sizes and accumulate the results in the output matrix.

I choose 1024 because I am assuming you will have a data layout, where is probably the smallest size, where you can still get full memory bandwidth when reading from device memory, and all threads in the warp work together. (adjust this accordingly if you know better than me or you are not using nVidia and any other GPUs you use will work with something better)

Now that your elements are reasonably sized, you can now turn to traditional linear algebra optimizations to compute this product. I would probably do the following:

Each warp is assigned a large number of rows in the output matrix. It reads the corresponding elements from the second vector and then iterates through the first vector, calculating the products.

You can use all skews independently of each other, but it might be better to do the following:

  • All skews in the block work together to load some number of elements from the first vector
  • Each warp calculates the intersections it can and writes the results to the output matrix

You can keep loaded items in shared memory, but you can get better results by keeping them in registries. Each warp can only calculate intersections with the set elements, and you, but after that, all deformations can rotate, which deformations are holding which elements.

If you do enough optimizations along these lines, you will probably reach a point where you are no longer memory bound, which means you don't have to go that far to perform the most complex optimizations (like the generic memory above may already be sufficient).

+1


source







All Articles