Sklearn decomposes upper limbs

Is there a way to define top functions / terms for each cluster during data decomposition?

in this example from the sklearn documentation, the top terms are extracted by sorting the functions and comparing against the vectorizer function names as with the same number of functions.

http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html

I would like to know how to implement get_top_terms_per_cluster ():

X = vectorizer.fit_transform(dataset)  # with m features
X = lsa.fit_transform(X)  # reduce number of features to m'
k_means.fit(X)
get_top_terms_per_cluster()  # out of m features

      

+3


source to share


1 answer


Assuming lsa = TruncatedSVD(n_components=k)

for some k

, an obvious way to get the weights is using the fact that LSA / SVD is a linear transform, i.e. each row from lsa.components_

is a weighted sum of the input terms, and you can multiply that with the k-means cluster centroids.

Let's ask some things and prepare some models:

>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import TruncatedSVD
>>> data = fetch_20newsgroups()
>>> vectorizer = TfidfVectorizer(min_df=3, max_df=.95, stop_words='english')
>>> lsa = TruncatedSVD(n_components=10)
>>> km = KMeans(n_clusters=3)
>>> X = vectorizer.fit_transform(data.data)
>>> X_lsa = lsa.fit_transform(X)
>>> km.fit(X_lsa)

      

Now multiply the LSA components and the k-means centroids:



>>> X.shape
(11314, 38865)
>>> lsa.components_.shape
(10, 38865)
>>> km.cluster_centers_.shape
(3, 10)
>>> weights = np.dot(km.cluster_centers_, lsa.components_)
>>> weights.shape
(3, 38865)

      

Then type; we need the absolute values ​​for the weights due to the ambiguity of the sign in the LSA:

>>> features = vectorizer.get_feature_names()
>>> weights = np.abs(weights)
>>> for i in range(km.n_clusters):
...     top5 = np.argsort(weights[i])[-5:]
...     print(zip([features[j] for j in top5], weights[i, top5]))
...     
[(u'escrow', 0.042965734662740895), (u'chip', 0.07227072329320372), (u'encryption', 0.074855609122467345), (u'clipper', 0.075661844826553887), (u'key', 0.095064798549230306)]
[(u'posting', 0.012893125486957332), (u'article', 0.013105911161236845), (u'university', 0.0131617377000081), (u'com', 0.023016036009601809), (u'edu', 0.034532489348082958)]
[(u'don', 0.02087448155525683), (u'com', 0.024327099321009758), (u'people', 0.033365757270264217), (u'edu', 0.036318114826463417), (u'god', 0.042203130080860719)]

      

Remember, you really need a stop word filter for this to work. Stop words usually end in every single component and get a lot of weight in every clustered center.

+6


source







All Articles