How to scale cholesky factorization across multiple GPUs

Question

How to scale cholesky factorization across multiple GPUs

I have implemented Cholesky Factorization to solve a large linear equation on a GPU using the ATI Stream SDK. Now I want to use the processing power of more and more GPUs, and I want to run this code on multiple GPUs.

I currently have one computer and one GPU installed and cholesky factorization is working correctly. I want to do this for machine N and they have one GPU installed. So suggest to me how I should proceed.

+2

parallel-processing gpu gpgpu distributed-computing

GG. 09 Sep At 6:08 am

source to share

3 answers

This is a very specialized issue. Suggest you check the Stream Developer Resources and the Stream Developer Forums .

0

Mitch wheat 09 Sep '09 at 6:16

source to share

I showed this Q to my colleague who knows about these things. He suggested using ScaLAPACK.

0

Die in sente 11 Sep '09 at 16:30

source to share

Eric · Accepted Answer · 2009-09-10T08:46:33+0000

First, you should be aware that this approach will introduce three levels of latency for any communication between nodes:

GPU memory on machine 1 in main memory on machine 1
Main memory on machine 1 in main memory on machine 2
Main memory on machine 2 to GPU memory on machine 2

A good first step is to do some spin calculations to determine if the acceleration you get when splitting the problem across multiple machines outweighs the delay you introduce.

Once you are sure that the approach is the one you want to follow, then you still need to correctly implement it. Note that the NVIDIA CUDA or OpenCL libraries are your best bets at this time, as they will allow you to access the GPU for computation without having to tie it to an X session. When ATI's OpenCL implementation supports GPUs, this should also be a viable option.

Since you already have a working GPU, here are the basic steps you should follow:

Determine how you update your factorization algorithm to support processing by individual nodes.
Set up communication between N computers (I noticed you chose MPI for this)
Set up a scatter operation that divides the input problem among compute nodes
Set up communication between the machine and its GPU
Set up a collect operation that will collect results from nodes into one node

How to scale cholesky factorization across multiple GPUs

More articles: