Using BIC, AIC to estimate the number of clusters in document clustering using Kmeans
In my approach, I am trying to find the optimal "k" value for clustering a set of documents using the KMEANS algorithm. I wanted to use the "AIC" and "BIC" function to find the best model. I used this resource "sherrytowers.com/2013/10/24/k-means-clustering/" to find the best "k" value.
But I got the following graphs for AIC and BIC when I ran the code. I cannot interpret anything from the graphs. my doubts:
- Is my approach wrong and these measures (AIC, BIC) cannot be used to cluster documents using Kmeans?
- Or there are errors in the programming logic, and "AIC" and "BIC" is the correct way to find the "k" number of clusters?
Here's my code
library(tm)
library(SnowballC)
corp <- Corpus(DirSource("/home/dataset/"), readerControl = list(blank.lines.skip=TRUE)); ## forming Corpus from document set
corp <- tm_map(corp, stemDocument, language="english")
dtm <- DocumentTermMatrix(corp,control=list(minwordlength = 1)) ## forming Document Term Matrix
dtm_tfidf <- weightTfIdf(dtm)
m <- as.matrix(dtm_tfidf)
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)
kmax = 50
totwss = rep(0,kmax) # will be filled with total sum of within group sum squares
kmfit = list() # create and empty list
for (i in 1:kmax){
kclus = kmeans(m_norm,centers=i,iter.max=20)
totwss[i] = kclus$tot.withinss
kmfit[[i]] = kclus
}
kmeansAIC = function(fit){
m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}
aic=sapply(kmfit,kmeansAIC)
plot(seq(1,kmax),aic,xlab="Number of clusters",ylab="AIC",pch=20,cex=2)
kmeansBIC = function(fit){
m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + log(n)*m*k)
}
bic=sapply(kmfit,kmeansBIC)
plot(seq(1,kmax),bic,xlab="Number of clusters",ylab="BIC",pch=20,cex=2)
These are the graphs generated by http://snag.gy/oAfhk.jpg http://snag.gy/vT8fZ.jpg
+3
source to share
No one has answered this question yet
Check out similar questions: