DPGMM Cluster all values ​​into a single cluster

So, I turned my corpus into a nice word2vec matrix. This matrix is ​​a floating point matrix with negative and positive numbers.

I can't seem to get the endless dirichlet process to give me any coherent answer.

An example output [for 2 steps] looks like this:

original word2vec matrix:
[[-0.09597077 -0.1617426  -0.01935256 ...,  0.03843787 -0.11019679
   0.02837373]
 [-0.20119116  0.09759717  0.1382935  ..., -0.08172804 -0.14392921
  -0.08032629]
 [-0.04258473  0.03070175  0.11503845 ..., -0.10350088 -0.18130976
  -0.02993774]
 ..., 
 [-0.08478324 -0.01961064  0.02305113 ..., -0.01231162 -0.10988192
   0.00473828]
 [ 0.13998444  0.05631495  0.00559074 ...,  0.05252389 -0.14202785
  -0.03951728]
 [-0.02888418 -0.0327519  -0.09636743 ...,  0.10880557 -0.08889513
  -0.08584201]]
Running DGPMM for 20 clusters of shape (4480, 100)
Bound after updating        z: -1935576384.727921
Bound after updating    gamma: -1935354454.981427
Bound after updating       mu: -1935354033.389434
Bound after updating  a and b: -inf
Cluster proportions: [  4.48098985e+03   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00]
covariance_type: full
Bound after updating        z: -inf
Bound after updating    gamma: -inf
Bound after updating       mu: -inf
Bound after updating  a and b: -inf
Cluster proportions: [  4.48098985e+03   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00]

      

As you can see, it looks like z, gamma and mu all explode, and as a result the system converges to one cluster, which is not very accurate. I tried tinkering with the alpha for DPGMM but it doesn't really change much.

What I am trying to do is automatically cluster words that are closer to meaning using an offline clustering system. K-Means requires "K" which I don't want to provide.

+3


source to share


1 answer


There may be some hidden numerical problems here. The problem is the large dimensionality of your dataset. This will lead to infinitesimal probabilities in simulating a Gaussian mixture, making the model very unlikely. At some point, you get a value -inf

and then it fails.

Overall, clustering seems to just fail. If you look at the cluster sizes, you can see both numerical problems and the result is degenerate.



One cluster has a size of 4480.98985, the remaining 19 clusters have a size of 1.00053406. I think it should be up to 4480 ... but it doesn't. Also, 19 out of 20 clusters are single? Thus, you may have an outliers problem.

The K-remedy will also not work better.

+3


source







All Articles