No performance difference Eigen AVX vs SSE for single precision matrix operations?
In my project I am using the Eigen3.3 library to do calculations with 6x6 matrices. I decided to investigate if the AVX instructions actually give me SSE speedup. My processor supports both sets:
model name : Intel(R) Xeon(R) CPU E5-1607 v2 @ 3.00GHz
flags : ... sse sse2 ... ssse3 ... sse4_1 sse4_2 ... avx ...
So, I am compiling the small test below with gcc4.8 using two different sets of flags:
$ g++ test-eigen.cxx -o test-eigen -march=native -O2 -mavx
$ g++ test-eigen.cxx -o test-eigen -march=native -O2 -mno-avx
I confirmed that the second case -mno-avx
did not give any instructions with registers ymm
. However, the two cases give me very similar results of around 520ms measured with perf
.
Here is the test-eigen.cxx program (it refers to the sum of two matrices to be close to the actual problem I am working on):
#define NDEBUG
#include <iostream>
#include "Eigen/Dense"
using namespace Eigen;
int main()
{
typedef Matrix<float, 6, 6> MyMatrix_t;
MyMatrix_t A = MyMatrix_t::Random();
MyMatrix_t B = MyMatrix_t::Random();
MyMatrix_t C = MyMatrix_t::Zero();
MyMatrix_t D = MyMatrix_t::Zero();
MyMatrix_t E = MyMatrix_t::Constant(0.001);
// Make A and B symmetric positive definite matrices
A.diagonal() = A.diagonal().cwiseAbs();
A.noalias() = MyMatrix_t(A.triangularView<Lower>()) * MyMatrix_t(A.triangularView<Lower>()).transpose();
B.diagonal() = B.diagonal().cwiseAbs();
B.noalias() = MyMatrix_t(B.triangularView<Lower>()) * MyMatrix_t(B.triangularView<Lower>()).transpose();
for (int i = 0; i < 1000000; i++)
{
// Calculate C = (A + B)^-1
C = (A + B).llt().solve(MyMatrix_t::Identity());
D += C;
// Somehow modify A and B so they remain symmetric
A += B;
B += E;
}
std::cout << D << "\n";
return 0;
}
Should I really expect performance improvements with AVX in Eigen? Or am I missing something in the compiler flags or in my own config? Perhaps my test is not appropriate to demonstrate the difference, but I cannot see what could be wrong with it.
source to share
You are using matrices too small to use AVX: with one precision, AVX works with batches of 8 scalars at once. When using 6x6 matrices AVX can only be used for pure component operations such as A = B + C
because they can be thought of as operations on 1D vectors of size 36 that are greater than 8. In your case, these kinds of operations are insignificant compared to the cost of Cholesky factorization and solve ...
To see the difference, go to MatrixXf
matrices 100x100 or larger.
source to share