Vecnising code for calculating (squared) Mahalanobi Deviation
EDIT 2: This post seems to have been moved from CrossValidated to StackOverflow due to being mostly programming related, but that means MathJax no longer works. Hopefully this is still readable.
Let's say I want to calculate the squared Mahalanobis distance between two vectors x
and y
with a covariance matrix S
. This is a fairly simple function defined by
M2(x, y; S) = (x - y)^T * S^-1 * (x - y)
With the python package numpy
I can do it like
# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)
or in R as
diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff
In my case, however, I am given
-
m
on then
matrixx
-
n
-dimensional vectormu
-
n
byn
covariance matrixS
and we want to find a m
-dimensional vector d
such that
d_i = M2(x_i, mu; S) ( i = 1 .. m )
where x_i
is the i
th row x
.
It's not hard to accomplish with a simple loop in python:
d = numpy.zeros((m,))
for i in range(m):
diff = x[i,:] - mu
d[i] = diff.T.dot(s_inv).dot(diff)
Of course, given that the outer loop happens in python and not in the native code in the library numpy
, this means it's not as fast as it could be. $ n $ and $ m $ are about 3-4 and several hundred thousand respectively, and I do this several times in an interactive program, so speeding up will be very helpful.
Mathematically, the only way I have been able to formulate this using base matrix operations is
d = diag( X' * S^-1 * X'^T )
Where
x'_i = x_i - mu
which is easy to write a vectorized version, but this is unfortunately outweighed by the inefficiency of calculating a matrix of 10 billion plus elements and only with a diagonal ... I believe this operation should be easily expressed using Einstein's notation, and hence hopefully quickly evaluate with the function numpy
einsum
, but I haven't even begun to figure out how this black magic works.
So, I would like to know: is there a better way to formulate this operation mathematically (in terms of simple matrix operations) or can anyone suggest some good vector code (python or R) that does this efficiently?
BONUS QUESTION, for the brave
I really don't want to do it once, I want to do it k
~ 100 times. Given:
-
m
vian
matrixx
-
k
onn
matrixU
-
Set
n
on then
covariance matrices, each of which is denoted byS_j
(j = 1..k
)
Find a matrix m
by k
d
such that
D_i,j = M(x_i, u_j; S_j)
Where i = 1..m
, j = 1..k
, x_i
it is i
th row x
, and u_j
is j
th row U
.
Ie, vectorize the following code:
# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
for i in range(m):
diff = x[i, :] - u[j, :]
d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)
source to share
First, it looks like you are getting S and then inverting it. You don't have to do this; it is slow and numerically imprecise. Instead, you should get the Cholesky coefficient L from S, so S = LL ^ T; then
M^2(x, y; L L^T)
= (x - y)^T (L L^T)^-1 (x - y)
= (x - y)^T L^-T L^-1 (x - y)
= || L^-1 (x - y) ||^2,
and since L is triangular L ^ -1 (x - y) can be computed efficiently.
As it turns out, scipy.linalg.solve_triangular
will happily make a bunch of them all at once if you change it correctly:
L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)
Breaking it down a bit, y[i, j]
is the i-th component of L ^ -1 (X_j - \ mu). Then the call einsum
does
d_j = \sum_i y_{ij} y_{ij} = \sum_i y_{ij}^2 = || y_j ||^2,
as we need.
Unfortunately solve_triangular
won't vectorize on its first argument, so you should probably just loop. If k is only around 100, this won't be a major problem.
If you actually got S ^ -1 and not S, then you can actually do it einsum
more directly. Since S is quite small in your case, it is also possible that actually inverting the matrix and then executing will be faster. However, once n is a non-trivial size, you are throwing away a lot of numerical precision by doing this.
To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus by writing S_j ^ -1 = T_j for notation convenience:
D_{ij} = M^2(x_i, u_j; S_j)
= (x_i - u_j)^T T_j (x_i - u_j)
= \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
= \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
= \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})
So, if we create arrays of X
shape (m, n)
, U
shape (k, n)
and T
shape (k, n, n)
, then we can write this as
diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)
where diff[j, i, k] = X_[i, k] - U[j, k]
.
source to share
Dougal posted this with a great and detailed answer, but thought I'd share a small change that I've found improves efficiency if anyone tries to implement this. Straight to the point:
Dougal's method was as follows:
def mahalanobis2(X, mu, sigma):
L = np.linalg.cholesky(sigma)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
return np.einsum('ij,ij->j', y, y)
The mathematically equivalent option I tried is
def mahalanobis2_2(X, mu, sigma):
# Cholesky decomposition of inverse of covariance matrix
# (Doing this in either order should be equivalent)
linv = np.linalg.cholesky(np.linalg.inv(sigma))
# Just do regular matrix multiplication with this matrix
y = (X - mu[np.newaxis,:]).dot(linv)
# Same as above, but note different index at end because the matrix
# y is transposed here compared to above
return np.einsum('ij,ij->i', y, y)
Compare both versions head-to-head 20 times using the same random inputs and record the time (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3), I get:
Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37
This is about 30% speedup for the 2nd version. I'm basically going to run this in 3 or 4 dimensions, but to see how it scales up I tried X as 1,000,000 x 100 and got
Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837
which is about the same improvement.
I mentioned this in a comment on Dougal, but adding some extra visibility here:
The first pair of methods above takes one center point mu
and covariance matrix sigma
and calculates the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets mu
and sigma
and output a two dimensional matrix. The set of methods described above can be used to achieve this with a simple loop, but Dougal also posted a smarter example using einsum
.
I decided to compare these methods with each other, using them to solve the following problem: Given the k
d
-dimensional normal distributions (with centers stored in rows k
on d
matrix U
and covariance matrices in the last two dimensions k
on an d
array d
S
), find the density at n
points stored in rows n
on d
matrix X
.
The density of the multivariate normal distribution is a function of the square of the Mahalanobis distance from point to mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf
a reference. I ran all three methods against each other 10 times, using the same random parameters each time with d=3, k=96, n=5e5
. Here are the results, in points / sec:
[Method]: (min/max/avg)
Scipy: 1.18e5/1.29e5/1.22e5
Fancy 1: 1.41e5/1.53e5/1.48e5
Fancy 2: 8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4
where Fancy 1
is the best of the two above methods and Fancy2
is Dougal's second solution. Since the Fancy 2
inverse of all covariance matrices needs to be computed, I also tried the "cheat" version where it was passed as a parameter, but it doesn't seem to make a difference. I had planned to include a non-vectorized implementation, but it was so slow that it would have taken all day.
What we can take away from this is that using Dougal's first method is about 20% faster than Scipy does. Unfortunately, despite his skill, the second method is about 60% of the first. There may be some other optimizations that can be done, but this is already fast enough for me.
I've also tested how this scales with higher dimensionality. Using d=100, k=96, n=1e4
:
Scipy: 7.81e3/7.91e3/7.86e3
Fancy 1: 1.03e4/1.15e4/1.08e4
Fancy 2: 3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3
Fancy 1
seems to have an even bigger advantage this time around. It's also worth noting that Scipy threw LinAlgError
8/10 times, probably because some of my randomly generated 100x100 covariance matrices were close to singular (which could mean the other two methods aren't that numerically stable, I haven't actually tested the results).
source to share