Use affinity function for scikit-learn clustering

Question

Use affinity function for scikit-learn clustering

I am using a function to calculate the similarity between a pair of documents and unnecessary clustering using this similarity measure.
The code is so far away

Sim=np.zeros((n, n)) # create a numpy arrary  
i=0  
j=0       
for i in range(0,n):      
   for j in range(i,n):  
    if i==j:  
        Sim[i][j]=1
     else:    
         Sim[i][j]=simfunction(list_doc[i],list_doc[j]) # calculate similarity between documents i and j using simfunction
Sim=Sim+ Sim.T - np.diag(Sim.diagonal()) # complete the symmetric matrix

AggClusterDistObj=AgglomerativeClustering(n_clusters=num_cluster,linkage='average',affinity="precomputed") 
Res_Labels=AggClusterDistObj.fit_predict(Sim)

I am worried that I used the similarity function here and I think according to the docs it should be a difference matrix, how can I change it to a dissimilarity matrix. It would also be a more efficient way to do this.

+3

python scikit-learn hierarchical-clustering

AMisra 03 Sep 14 at 17:13

source to share

1 answer

embert · Accepted Answer · 2014-10-02T07:02:14+0000

Please format your code correctly as indentation matters in Python.
If possible, keep the code in full (you left out import numpy as np

).
Since it range

always starts at zero, you can omit it and write range(n)

.

Indexing in numpy works like [i, j, k, ...].
So instead, Sim[i][j]

you really want to write Sim[i, j]

, because otherwise you would do two things: first take the entire slice of the row and then index the column. This is another way to copy the elements of the upper triangle to the lower one.

Sim = np.identity(n) # diagonal with ones (100 percent similarity)

for i in range(n):      
    for j in range(i+1, n):    # +1 skips the diagonal 
        Sim[i, j]= simfunction(list_doc[i], list_doc[j])

# Expand the matrix (copy triangle)
tril = np.tril_indices_from(Sim, -1) # take lower & upper triangle indices
triu = np.triu_indices_from(Sim, 1)  # (without diagonal)
Sim[tril] = Sim[triu]

Suppose you do have similarities within the range (0, 1), to convert your similarity matrix to distance matrix, which you can simply do

dm = 1 - Sim

This operation will be vectorized by numpy

Use affinity function for scikit-learn clustering

More articles: