Use affinity function for scikit-learn clustering
I am using a function to calculate the similarity between a pair of documents and unnecessary clustering using this similarity measure.
The code is so far away
Sim=np.zeros((n, n)) # create a numpy arrary
i=0
j=0
for i in range(0,n):
for j in range(i,n):
if i==j:
Sim[i][j]=1
else:
Sim[i][j]=simfunction(list_doc[i],list_doc[j]) # calculate similarity between documents i and j using simfunction
Sim=Sim+ Sim.T - np.diag(Sim.diagonal()) # complete the symmetric matrix
AggClusterDistObj=AgglomerativeClustering(n_clusters=num_cluster,linkage='average',affinity="precomputed")
Res_Labels=AggClusterDistObj.fit_predict(Sim)
I am worried that I used the similarity function here and I think according to the docs it should be a difference matrix, how can I change it to a dissimilarity matrix. It would also be a more efficient way to do this.
source to share
-
Please format your code correctly as indentation matters in Python.
-
If possible, keep the code in full (you left out
import numpy as np
). -
Since it
range
always starts at zero, you can omit it and writerange(n)
. -
Indexing in numpy works like [i, j, k, ...].
So instead,Sim[i][j]
you really want to writeSim[i, j]
, because otherwise you would do two things: first take the entire slice of the row and then index the column. This is another way to copy the elements of the upper triangle to the lower one.Sim = np.identity(n) # diagonal with ones (100 percent similarity) for i in range(n): for j in range(i+1, n): # +1 skips the diagonal Sim[i, j]= simfunction(list_doc[i], list_doc[j]) # Expand the matrix (copy triangle) tril = np.tril_indices_from(Sim, -1) # take lower & upper triangle indices triu = np.triu_indices_from(Sim, 1) # (without diagonal) Sim[tril] = Sim[triu]
-
Suppose you do have similarities within the range (0, 1), to convert your similarity matrix to distance matrix, which you can simply do
dm = 1 - Sim
This operation will be vectorized by numpy
source to share