How to calculate cofenetic correlation from a matrix of links obtained by the fast clustering method while maintaining hierarchical clustering?

I am using the fastcluster Python package to compute a link matrix for a hierarchical clustering procedure over a large set of observations.

Until now, such a good fastcluster method has been able linkage_vector()

to cluster a much larger set of observations than it scipy.linkage()

could compute using the same amount of memory.

With that, I now want to check the clustering results and calculate the coefficient of co-correlation > relative to the original data. A common procedure would be to first compute the coordinate matrix and then check the correlation with the original data. Using scipy method cophenet()

it will look something like this:

import fastcluster as fc
import numpy as np
from scipy.cluster.hierarchy import cophenet

X = np.random.random((1000,10))  # Original data (1000 observations)
Z = fc.linkage_vector(X)         # Clustering

orign_dists = fc.pdist(X)  # Matrix of original distances between observations
cophe_dists = cophenet(Z)  # Matrix of cophenetic distances between observations

# What I really want at the end of the day is
corr_coef = np.corrcoef(orign_dists, cophe_dists)[0,1]

      

However, this does not work when the set of observations is very large (just replace 1000 with 100000 or so and you will see). The Fastcluster algorithm has no clustering problems, but scipy does cophenet()

run into memory problems with the resulting link matrix.

In cases where the set of observations is too large to handle the standard scipy function, I am not aware of an alternative way to calculate the copensitary correlation offered by fastcluster or any other package. You? If so, how? If not, can you think of a smart and efficient iterative way to achieve this with a custom feature? I am collecting some ideas here, maybe even a solution.

+3


source to share





All Articles