How to compare the performance of the KMeans model with the characteristics of the GaussianMixture and LDA model in pyspark?

I am working on aperture dataset using the pyspark.ml.clustering library to understand the basics of pyspark and create a clustering pattern for me.

My spark is version 2.1.1 and I have hasoop 2.7.

I know that KMeans and BisectingKMeans have a computeCost () method that gives model performance based on the sum of the squared distances between the input points and their respective cluster centers.

Is there a way to compare the performance of a KMeans model with that of a GaussianMixture and LDA model on a set of aperture to select the best model type (KMeans, GaussianMixture, or LDA)?

+3


source to share


1 answer


Short answer: no

Long answer:

You are trying to compare apples to oranges here: the Gaussian mixture and LDA models lack the concept of cluster center ; hence, it is not strange that a function similar to computeCost()

it does not exist.

This is easy to see if you look at the actual output of the Gaussian mixture model; adapting an example from the documentation :

from pyspark.ml.clustering import GaussianMixture
from pyspark.ml.linalg import Vectors

data = [(Vectors.dense([-0.1, -0.05 ]),),
         (Vectors.dense([-0.01, -0.1]),),
         (Vectors.dense([0.9, 0.8]),),
         (Vectors.dense([0.75, 0.935]),),
         (Vectors.dense([-0.83, -0.68]),),
         (Vectors.dense([-0.91, -0.76]),)]

df = spark.createDataFrame(data, ["features"])
gm = GaussianMixture(k=3, tol=0.0001,maxIter=10, seed=10) # here we ask for k=3 gaussians
model = gm.fit(df)

transformed_df = model.transform(df)  # assign data to gaussian components ("clusters")
transformed_df.collect()

# Here the output:

[Row(features=DenseVector([-0.1, -0.05]), prediction=1, probability=DenseVector([0.0, 1.0, 0.0])), 
 Row(features=DenseVector([-0.01, -0.1]), prediction=2, probability=DenseVector([0.0, 0.0007, 0.9993])),
 Row(features=DenseVector([0.9, 0.8]), prediction=0, probability=DenseVector([1.0, 0.0, 0.0])), 
 Row(features=DenseVector([0.75, 0.935]), prediction=0, probability=DenseVector([1.0, 0.0, 0.0])), 
 Row(features=DenseVector([-0.83, -0.68]), prediction=1, probability=DenseVector([0.0, 1.0, 0.0])), 
 Row(features=DenseVector([-0.91, -0.76]), prediction=2, probability=DenseVector([0.0, 0.0006, 0.9994]))]

      

The actual output of the "clustering" of the Gaussian mixture is the third feature above, ie the column probability

: this is a 3D vector (because we asked k=3

) showing the "degree" to which a particular data point belongs to each of the three "clusters". In general, the vector components will be less than 1.0 and that Gaussian mixtures are a classic example of "soft clustering" (data points belonging to more than one cluster to some extent to each). Now some implementations (including those in Spark here) go even further and assign "hard" cluster membership (function prediction

above) by simply taking the index of the maximum component in probability

, but this is just an add-on.

How about the inference of the model itself?



model.gaussiansDF.show()

+--------------------+--------------------+ 
|                mean|                 cov| 
+--------------------+--------------------+ 
|[0.82500000000150...|0.005625000000006...|  
|[-0.4649980711427...|0.133224999996279...|
|[-0.4600024262536...|0.202493122264028...| 
+--------------------+--------------------+

      

Again, it is easy to see that there are no cluster centers, only the parameters (mean and covariance) of our k=3

gaussians.

Similar arguments are valid for the LDA case (not shown here).

It is true that the Spark MLlib Clustering Guide states that the column prediction

includes the "Center of Predicted Clusters", but that term is rather unfortunate to say the least (to put it bluntly, it's just plain wrong).

Needless to say, the above discussion comes directly from the basic concepts and theory behind Gaussian mixture patterns, and this is not the case for the Spark implementation ...

Functions such as computeCost()

will just help you evaluate different K-Means implementations (due to different initializations and / or random seeds), as the algorithm may converge to a non-optimal local minimum.

+2


source







All Articles