Vecnising code for calculating (squared) Mahalanobi Deviation

EDIT 2: This post seems to have been moved from CrossValidated to StackOverflow due to being mostly programming related, but that means MathJax no longer works. Hopefully this is still readable.

Let's say I want to calculate the squared Mahalanobis distance between two vectors x

and y

with a covariance matrix S

. This is a fairly simple function defined by

M2(x, y; S) = (x - y)^T * S^-1 * (x - y)


With the python package numpy

I can do it like

# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 =


or in R as

diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff


In my case, however, I am given

  • m

    on the n


  • n

    -dimensional vector mu

  • n

    by n

    covariance matrixS

and we want to find a m

-dimensional vector d

such that

d_i = M2(x_i, mu; S)  ( i = 1 .. m )


where x_i

is the i

th row x


It's not hard to accomplish with a simple loop in python:

d = numpy.zeros((m,))
for i in range(m):
    diff = x[i,:] - mu
    d[i] =


Of course, given that the outer loop happens in python and not in the native code in the library numpy

, this means it's not as fast as it could be. $ n $ and $ m $ are about 3-4 and several hundred thousand respectively, and I do this several times in an interactive program, so speeding up will be very helpful.

Mathematically, the only way I have been able to formulate this using base matrix operations is

d = diag( X' * S^-1 * X'^T )



 x'_i = x_i - mu


which is easy to write a vectorized version, but this is unfortunately outweighed by the inefficiency of calculating a matrix of 10 billion plus elements and only with a diagonal ... I believe this operation should be easily expressed using Einstein's notation, and hence hopefully quickly evaluate with the function numpy


, but I haven't even begun to figure out how this black magic works.

So, I would like to know: is there a better way to formulate this operation mathematically (in terms of simple matrix operations) or can anyone suggest some good vector code (python or R) that does this efficiently?

BONUS QUESTION, for the brave

I really don't want to do it once, I want to do it k

~ 100 times. Given:

  • m

    via n


  • k

    on n


  • Set n

    on the n

    covariance matrices, each of which is denoted by S_j

    ( j = 1..k


Find a matrix m

by k


such that

D_i,j = M(x_i, u_j; S_j)


Where i = 1..m

, j = 1..k

, x_i

it is i

th row x

, and u_j

is j

th row U


Ie, vectorize the following code:

# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
    for i in range(m):
        diff = x[i, :] - u[j, :]
        d[i, j] =[j, :, :]).dot(diff)



source to share

2 answers

First, it looks like you are getting S and then inverting it. You don't have to do this; it is slow and numerically imprecise. Instead, you should get the Cholesky coefficient L from S, so S = LL ^ T; then

M^2(x, y; L L^T)
  = (x - y)^T (L L^T)^-1 (x - y)
  = (x - y)^T L^-T L^-1 (x - y)
  = || L^-1 (x - y) ||^2,


and since L is triangular L ^ -1 (x - y) can be computed efficiently.

As it turns out, scipy.linalg.solve_triangular

will happily make a bunch of them all at once if you change it correctly:

L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)


Breaking it down a bit, y[i, j]

is the i-th component of L ^ -1 (X_j - \ mu). Then the call einsum


d_j = \sum_i y_{ij} y_{ij}
    = \sum_i y_{ij}^2
    = || y_j ||^2,


as we need.

Unfortunately solve_triangular

won't vectorize on its first argument, so you should probably just loop. If k is only around 100, this won't be a major problem.

If you actually got S ^ -1 and not S, then you can actually do it einsum

more directly. Since S is quite small in your case, it is also possible that actually inverting the matrix and then executing will be faster. However, once n is a non-trivial size, you are throwing away a lot of numerical precision by doing this.

To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus by writing S_j ^ -1 = T_j for notation convenience:

D_{ij} = M^2(x_i, u_j; S_j)
  = (x_i - u_j)^T T_j (x_i - u_j)
  = \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
  = \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
  = \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})


So, if we create arrays of X

shape (m, n)

, U

shape (k, n)

and T

shape (k, n, n)

, then we can write this as

diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)


where diff[j, i, k] = X_[i, k] - U[j, k]




Dougal posted this with a great and detailed answer, but thought I'd share a small change that I've found improves efficiency if anyone tries to implement this. Straight to the point:

Dougal's method was as follows:

def mahalanobis2(X, mu, sigma):
    L = np.linalg.cholesky(sigma)
    y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
    return np.einsum('ij,ij->j', y, y)


The mathematically equivalent option I tried is

def mahalanobis2_2(X, mu, sigma):

    # Cholesky decomposition of inverse of covariance matrix
    # (Doing this in either order should be equivalent)
    linv = np.linalg.cholesky(np.linalg.inv(sigma))

    # Just do regular matrix multiplication with this matrix
    y = (X - mu[np.newaxis,:]).dot(linv)

    # Same as above, but note different index at end because the matrix
    # y is transposed here compared to above
    return np.einsum('ij,ij->i', y, y)


Compare both versions head-to-head 20 times using the same random inputs and record the time (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3), I get:

Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37


This is about 30% speedup for the 2nd version. I'm basically going to run this in 3 or 4 dimensions, but to see how it scales up I tried X as 1,000,000 x 100 and got

Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837


which is about the same improvement.

I mentioned this in a comment on Dougal, but adding some extra visibility here:

The first pair of methods above takes one center point mu

and covariance matrix sigma

and calculates the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets mu

and sigma

and output a two dimensional matrix. The set of methods described above can be used to achieve this with a simple loop, but Dougal also posted a smarter example using einsum


I decided to compare these methods with each other, using them to solve the following problem: Given the k


-dimensional normal distributions (with centers stored in rows k

on d

matrix U

and covariance matrices in the last two dimensions k

on an d

array d


), find the density at n

points stored in rows n

on d

matrix X


The density of the multivariate normal distribution is a function of the square of the Mahalanobis distance from point to mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf

a reference. I ran all three methods against each other 10 times, using the same random parameters each time with d=3, k=96, n=5e5

. Here are the results, in points / sec:

[Method]: (min/max/avg)
Scipy:                      1.18e5/1.29e5/1.22e5
Fancy 1:                    1.41e5/1.53e5/1.48e5
Fancy 2:                    8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4


where Fancy 1

is the best of the two above methods and Fancy2

is Dougal's second solution. Since the Fancy 2

inverse of all covariance matrices needs to be computed, I also tried the "cheat" version where it was passed as a parameter, but it doesn't seem to make a difference. I had planned to include a non-vectorized implementation, but it was so slow that it would have taken all day.

What we can take away from this is that using Dougal's first method is about 20% faster than Scipy does. Unfortunately, despite his skill, the second method is about 60% of the first. There may be some other optimizations that can be done, but this is already fast enough for me.

I've also tested how this scales with higher dimensionality. Using d=100, k=96, n=1e4


Scipy:                      7.81e3/7.91e3/7.86e3
Fancy 1:                    1.03e4/1.15e4/1.08e4
Fancy 2:                    3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3


Fancy 1

seems to have an even bigger advantage this time around. It's also worth noting that Scipy threw LinAlgError

8/10 times, probably because some of my randomly generated 100x100 covariance matrices were close to singular (which could mean the other two methods aren't that numerically stable, I haven't actually tested the results).



All Articles