Should I be using CUDA here?

I need to multiply a very small size matrix (size - 10x10) with a vector several times from 50,000 to 100,000 times (maybe even more). This happens for 1000 different matrices (there could be many more). Will there be any significant performance gain by performing this operation on CUDA.

+3


source to share


3 answers


Yes, this is an ideal GPU task.



+4


source


If you want to multiply one matrix with a vector 50K times and each multiplication is a precondition of the previous one, then don't use CUDA. This is a serial issue, best CPU kits. However, if each multiplication is independent, you can multiply them simultaneously by CUDA.



The only time your program will give you a huge speedup is when each iteration of vector multiplication is independent of the data of other iterations. This way you will be able to run 50K or more iterations at the same time by running an equal number of threads.

+1


source


Depending on what exactly you are doing, yes, it can be done very quickly on the GPU, but you may need to run your own kernel to get good performance out of it.

Without knowing more about your problem, I cannot give you too much advice. But I could tell about the solution:

If you take one vector and multiply it by the same matrix several thousand times, you are much better off finding a closed form of a matrix of arbitrary cardinality. You can do this using the Cayley-Hamilton theorem or the Jordanian canonical form.

I can't seem to find an implementation of this from a quick googling, but given that I did it in first year linear algebra, it's not too bad. Some information about Jordan's normal form and how to raise it to an exponent can be found at http://en.wikipedia.org/wiki/Jordan_normal_form#Powers , and transformation matrices are just a matrix of eigenvectors and the inverse of that matrix.

Let's say you have a matrix A and you find Jord's normal form J and transformation matrices P, P ^ -1, you find

A ^ n = PJ ^ n P ^ -1

I can't find a good link to implement this, but computing the closed form of a 10x10 matrix would be significantly less time consuming than 50,000 matrix multiplications. And implementing this will probably run much faster on the processor.

If this is your problem, you should look into it.

+1


source







All Articles