How to speed up adding a sparse array

For data modeling purposes, I am looking for an efficient way to make a weighted sum of sparse matrices. Basically I have a data cube with Nx x Ny x Nz double values, where Nx and Ny are in the order of 4000 and Nz are in the order of several million. All Nx x Ny submatrices are very sparse (about 40 data blocks are required). weighted sum Now I want to shrink the data cube in the Z direction by summing all matrices and weighing them. The process is illustrated in the figure. For my simulations, all matrices remain fixed and only the weights will change and generate different Nx x Ny datasets.

This is what I tried: a naive implementation of a sparse matrix in C ++ and a simple sum.

#ifndef SPARSEARRAY3D_H
#define SPARSEARRAY3D_H

#include <vector>

struct data{
    unsigned short int x;
    unsigned short int y;
    int z;
    double value;
};

class sparsearray3d
{
public:
    sparsearray3d();
    void createRandomData(int Nx, int Ny, int Nz, int bwidthX, int bwidthY);
    void sumData();
    int Nx,Ny,Nz;

    std::vector<data> dd;
    std::vector<std::vector<double> > image;
    std::vector<double> weights;
};

#endif // SPARSEARRAY3D_H

      

sparsearray3d.cpp

#include "sparsearray3d.h"
#include <stdlib.h>     /* srand, rand */
#include <stdio.h>      /* printf, scanf, puts, NULL */


sparsearray3d::sparsearray3d()
{
    this->Nx = 0;
    this->Ny = 0;
    this->Nz = 0;
}

void sparsearray3d::createRandomData(int Nx, int Ny, int Nz, int bwidthX = 5, int bwidthY = 5)
{
    // create random data
    this->weights.resize(Nz);

    this->image.resize( Nx , std::vector<double>( Ny , 0. ) );

    this->Nx = Nx;
    this->Ny = Ny;
    this->Nz = Nz;

    for(int i=0; i<Nz; ++i)
    {
        int x0 = rand() % (Nx-bwidthX);
        int y0 = rand() % (Ny-bwidthY);

        this->weights.push_back((double) rand() / (RAND_MAX));

        for(int j=0; j<bwidthX; ++j)
        {
            for(int k=0; k<bwidthY; ++k)
            {
                this->dd.push_back({x0+j,y0+k,i,((double) rand() / (RAND_MAX))});
            }
        }
    }
    printf("Vector size: %4.2f GB \n", this->dd.size()*sizeof(data) * 1E-9);

}

void sparsearray3d::sumData()
{
    std::vector<data>::iterator it;
    #pragma omp parallel for
    for(it = this->dd.begin(); it < this->dd.end(); ++it)
    {
        this->image[it->y][it->x] += it->value * this->weights[it->z];
    }
}

      

main.cpp

#include <iostream>
#include "sparsearray3d.h"
#include <sys/time.h>

using namespace std;

int main()
{

struct timeval start, end;

sparsearray3d sa;
gettimeofday(&start, NULL);
sa.createRandomData(4096, 4096, 2000000, 4, 16);
gettimeofday(&end, NULL);
double delta = ((end.tv_sec  - start.tv_sec) * 1000000u +
         end.tv_usec - start.tv_usec) / 1.e6;

cout << "random array generation: " << delta << endl;
gettimeofday(&start, NULL);
sa.sumData();
gettimeofday(&end, NULL);
delta = ((end.tv_sec  - start.tv_sec) * 1000000u +
         end.tv_usec - start.tv_usec) / 1.e6;
cout << "array addition: " << delta << endl;
return 0;
}

      

This is already good work, the above example works here for ~ 0.6s. The first thing I'm wondering is why the parallel #pragma omp pair only gives a 2x speedup even though it uses 4 processors.

The problem seems to fit very well for massively parallelizing. Can Cuda / OpenCL help here? However, I read somewhere that matrix addition is not very efficient with Cuda / OpenCL. (I don't have an NVIDIA card though). Alternatively, I read a little about graphs and their relationship to matrices. Could this issue be handled with some graph algorithms?

EDIT: I tried to give Eigen a shot; However, I was unable to create a large number of matrices. The following code requires much more memory than my code (and does not work for N ~ 20,000,000 as I am running out of memory). Not sure if I am doing this correctly, but the way I figured it out from my own documentation.

#include <vector>
#include <eigen3/Eigen/Sparse>

int main()
{
int N=100000;
std::vector<Eigen::SparseMatrix<double> > data;

data.resize(N);

for (int i=0; i<N; ++i)
{
    data[i].resize(4096,4096);
    data[i].reserve(4*16);
}

return 0;
}

      

Also, summing sparse matrices like this was much slower than my code:

Eigen::SparseMatrix<double> sum(4096,4096) ;
sum.reserve(4096*4096);
for(int i=0; i<N; ++i)
    sum+=data[i];

      

+3


source to share


1 answer


What you are dealing with is a relatively general case of linear algebra matrix vector multiplication (some relevant computer abbreviations for searching are "DGEMM" or "SpMV"). So your first option is to try optimized parallel linear algebra libraries like Intel MKL , see also: Parallel linear algebra for a multi-core system

Secondly, if you want to optimize and parallelize your algorithm yourself, you can study some relevant articles first (for example: - http://cscads.rice.edu/publications/pdfs/Williams-OptMultiCore-SC07.pdf  - http : //pcl.intel-research.net/publications/ics26-liuPS.pdf )

Third, a few things to keep in mind if you don't want or don't have time to go through books, articles or libraries, but want to experiment with everything yourself:



  • What you have is actually fine-grained parallelism (each iteration is too fast, too small), so the normalized streaming scheduling overhead can be relatively tricky to "amurtize".
  • For fine-grained parallelism it is better to try SIMD parallelism (especially since you have the potential for FMA == fused multiply add "here). In OpenMP4 .x (ICC and GCC4.9 + supported) you have #pragma omp simd which is very works well for shortcut simd.
  • But the right choice would be to simplify / rearrange your example and make it an explicit loopnest with a for loop by x (#pragma omp for) and the following by y loop (#pragma omp simd). Then you will have 2 levels of parallelism , which will make you in better shape.
  • Once you've successfully done all of the above, you will quickly end up being constrained by either memory cache / latency or memory bandwidth or an irregular access pattern - sort of 3 sides of a memory wall.In your current implementation, I expect you to be constrained by the bandwidth transmissions, however, truly optimized sparse matrix-matrix products are generally limited to different combinations of the three faces. In this case, reading some of the articles will be necessary, and the consideration of the issues of "Locking Loops", "Array Structure", "Fetching and Streaming" will become very relevant.

Finally (probably the first element), I don't think you actually have a "sparse matrix" -important data structures here, because you are not really trying to avoid "zeros" (they still sit in your matrix-vector). I don't know if you've already used your data structures already (so your example just simplifies the real code ..).

Parallel linear algebra is a huge topic in computer science; every next year, some specialists produce new intelligence. And every next generation of processors, and sometimes coprocessors, is better optimized to speed up parallel linear algebra cores. So there are many things to learn.

+1


source







All Articles