How to increase the performance of Sklearn GMM ()?
I am using Sklearn to estimate a Gaussian Mixture Model (GMM) for some data.
After evaluation, I have many query points. I would like to get their probabilities of belonging to each of the estimated Gaussian ones.
Below is the code. However, the part is gmm_sk.predict_proba(query_points)
very slow as I need to run it multiple times on 100000 sample sets where each sample contains 1000 points.
I assume this is because it is consistent. Is there a way to make it parallel? Or any other way to make it faster? Maybe on a GPU using TensorFlow?
I saw that TensorFlow has its own GMM algorithm , but it was very difficult to implement.
Here is the code I wrote:
import numpy as np
from sklearn.mixture import GaussianMixture
import time
n_gaussians = 1000
covariance_type = 'diag'
points = np.array(np.random.rand(10000, 3), dtype=np.float32)
query_points = np.array(np.random.rand(1000, 3), dtype=np.float32)
start = time.time()
#GMM with sklearn
gmm_sk = GaussianMixture(n_components = n_gaussians, covariance_type=covariance_type)
gmm_sk.fit(points)
mid_t = time.time()
elapsed = time.time() - start
print("learning took "+ str(elapsed))
temp = []
for i in range(2000):
temp.append(gmm_sk.predict_proba(query_points))
end_t = time.time() - mid_t
print("predictions took " + str(end_t))
I solved it! using multiprocessing
. just replaced
temp = []
for i in range(2000):
temp.append(gmm_sk.predict_proba(query_points))
from
import multiprocessing as mp
query_points = query_points.tolist()
parallel = mp.Pool()
fv = parallel.map(par_gmm, query_points)
parallel.close()
parallel.join()
source to share
You can speed up the process if you fit with a diagonal or spherical covariance matrix instead of the full one.
Using:
covariance_type='diag'
or
covariance_type='spherical'
inside GaussianMixture
Also, try reducing the Gaussian components .
However, keep in mind that this may affect the results, but I see no other way to speed up the process.
source to share
I see that your number of Gaussian components in the GMM is 1000, which I think is a very large number considering that the data dimension is relatively low (3). This is probably the reason it is slow as it needs to evaluate 1000 individual Gaussians. If your count is low, then it is also very prone to overfitting. You can try fewer components, which will naturally be faster and more likely to generalize better.
source to share