Distance matrix for rows in pandas dataframe

I have a pandas framework that looks like this:

In [23]: dataframe.head()
Out[23]: 
column_id   1  10  11  12  13  14  15  16  17  18 ...  46  47  48  49   5  50  \
row_id                                            ...                           
1         NaN NaN   1   1   1   1   1   1   1   1 ...   1   1 NaN   1 NaN NaN   
10          1   1   1   1   1   1   1   1   1 NaN ...   1   1   1 NaN   1 NaN   
100         1   1 NaN   1   1   1   1   1 NaN   1 ... NaN NaN   1   1   1 NaN   
11        NaN   1   1   1   1   1   1   1   1 NaN ... NaN   1   1   1   1   1   
12          1   1   1 NaN   1   1   1   1 NaN   1 ...   1 NaN   1   1 NaN   1   

      

I am currently using Pearson correlation to calculate the similarity between strings and given the nature of the data, sometimes std deviation is zero (all values ​​are 1 or NaN), so pearson correlation returns this:

In [24]: dataframe.transpose().corr().head()
Out[24]: 
row_id   1  10  100  11  12  13  14  15  16  17 ...  90  91  92  93  94  95  \
row_id                                          ...                           
1      NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
10     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
100    NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
11     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
12     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN    

      

Is there another way to calculate correlations that avoids this? Maybe a simple way to calculate the Euclidean line spacing in one way, like Pearson's correlation?

Thank!

and.

+4


source to share


3 answers


The key question here is which distance metric to use.

Let's say this is your data.

>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
   0   1   2   3   4   5   6   7   8   9  ...  40  41  42  43  44  45  46  47  \
0   1   1   1 NaN   1 NaN NaN   1   1   1 ...   1   1 NaN   1 NaN   1   1   1
1   1   1   1 NaN   1   1   1   1   1   1 ... NaN   1   1 NaN NaN   1   1   1
2   1   1   1   1   1   1   1   1   1   1 ...   1 NaN   1   1   1   1   1 NaN
3   1 NaN   1 NaN   1 NaN   1 NaN   1   1 ...   1   1   1   1 NaN   1   1   1
4   1   1   1   1   1   1   1   1 NaN   1 ... NaN   1   1   1   1   1   1   1

      

What is the% difference?

You can calculate the distance metric as a percentage of the values ​​that differ between each column. The result shows the% difference between any two columns.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
     0     1     2     3     4     5     6     7     8     9   ...     40  \
0  0.00  0.36  0.33  0.37  0.32  0.41  0.35  0.33  0.39  0.33  ...   0.37
1  0.36  0.00  0.37  0.29  0.30  0.37  0.33  0.37  0.33  0.31  ...   0.35
2  0.33  0.37  0.00  0.36  0.29  0.38  0.40  0.34  0.30  0.28  ...   0.28
3  0.37  0.29  0.36  0.00  0.29  0.30  0.34  0.26  0.32  0.36  ...   0.36
4  0.32  0.30  0.29  0.29  0.00  0.31  0.35  0.29  0.29  0.25  ...   0.27

      

What is the correlation coefficient?

Here we are using Pearson's correlation coefficient. This is a perfectly valid metric. In particular, it is converted to phi factor in the case of binary data.



>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  1.000000  0.013158  0.026262 -0.059786 -0.024293 -0.078056  0.054074
1  0.013158  1.000000 -0.093109  0.170159  0.043187  0.027425  0.108148
2  0.026262 -0.093109  1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786  0.170159 -0.124540  1.000000  0.004245  0.184153  0.042524
4 -0.024293  0.043187 -0.048485  0.004245  1.000000  0.079196 -0.099834

      

By the way, this is the same result as with the Spearman R coefficient.

What is Euclidean distance?

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  0.000000  6.000000  5.744563  6.082763  5.656854  6.403124  5.916080
1  6.000000  0.000000  6.082763  5.385165  5.477226  6.082763  5.744563
2  5.744563  6.082763  0.000000  6.000000  5.385165  6.164414  6.324555
3  6.082763  5.385165  6.000000  0.000000  5.385165  5.477226  5.830952
4  5.656854  5.477226  5.385165  5.385165  0.000000  5.567764  5.916080

      

You will now have a sense of the picture. Create a method distance

. Then apply it in pairs to each column using

data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))

      

If your method distance

relies on zeros instead of nan

s, convert to zeros with .fillna(0)

.

+13


source


A suggestion to improve on @ s-anand's excellent answer and for Euclidean distance: instead of

zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)

      

we can apply fillna to fill only missing data, thus:



distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))

      

Thus, the distance for missing dimensions will not be counted.

+2


source


This is my numpy

-only version of @S Anand's fantastic answer I've put together to better understand his explanation.

Glad to share this with a short, reproducible example:

# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

      

Let's try first scipy.stats.pearsonr

.

Performance:

distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt

      

returns: enter image description here

and:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0], 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

      

returns:

array([[1.00, -0.12, 0.87, 0.82, 0.78],
       [-0.12, 1.00, -0.43, -0.37, -0.43],
       [0.87, -0.43, 1.00, 0.96, 0.95],
       [0.82, -0.37, 0.96, 1.00, 0.96],
       [0.78, -0.43, 0.95, 0.96, 1.00]])

      

As a second example, let's try the distance correlation from dcor

library .

Performance:

import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt 

      

returns: enter image description here

while:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2), 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

      

returns:

array([[1.00, 0.31, 0.86, 0.83, 0.78],
       [0.31, 1.00, 0.54, 0.51, 0.51],
       [0.86, 0.54, 1.00, 0.97, 0.95],
       [0.83, 0.51, 0.97, 1.00, 0.95],
       [0.78, 0.51, 0.95, 0.95, 1.00]])

      

0


source







All Articles