Distance matrix for rows in pandas dataframe

Question

Distance matrix for rows in pandas dataframe

I have a pandas framework that looks like this:

In [23]: dataframe.head()
Out[23]: 
column_id   1  10  11  12  13  14  15  16  17  18 ...  46  47  48  49   5  50  \
row_id                                            ...                           
1         NaN NaN   1   1   1   1   1   1   1   1 ...   1   1 NaN   1 NaN NaN   
10          1   1   1   1   1   1   1   1   1 NaN ...   1   1   1 NaN   1 NaN   
100         1   1 NaN   1   1   1   1   1 NaN   1 ... NaN NaN   1   1   1 NaN   
11        NaN   1   1   1   1   1   1   1   1 NaN ... NaN   1   1   1   1   1   
12          1   1   1 NaN   1   1   1   1 NaN   1 ...   1 NaN   1   1 NaN   1

I am currently using Pearson correlation to calculate the similarity between strings and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so pearson correlation returns this:

In [24]: dataframe.transpose().corr().head()
Out[24]: 
row_id   1  10  100  11  12  13  14  15  16  17 ...  90  91  92  93  94  95  \
row_id                                          ...                           
1      NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
10     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
100    NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
11     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
12     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN

Is there another way to calculate correlations that avoids this? Maybe a simple way to calculate the Euclidean line spacing in one way, like Pearson's correlation?

Thank!

and.

+4

python numpy pandas

misterte Apr 18 15 at 22:20

source to share

3 answers

A suggestion to improve on @ s-anand's excellent answer and for Euclidean distance: instead of

zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)

we can apply fillna to fill only missing data, thus:

distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))

Thus, the distance for missing dimensions will not be counted.

+2

maparent 10 jan. 17 at 21:12

source to share

This is my numpy

-only version of @S Anand's fantastic answer I've put together to better understand his explanation.

Glad to share this with a short, reproducible example:

# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

Let's try first scipy.stats.pearsonr

.

Performance:

distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt

returns:

and:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0], 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

array([[1.00, -0.12, 0.87, 0.82, 0.78],
       [-0.12, 1.00, -0.43, -0.37, -0.43],
       [0.87, -0.43, 1.00, 0.96, 0.95],
       [0.82, -0.37, 0.96, 1.00, 0.96],
       [0.78, -0.43, 0.95, 0.96, 1.00]])

As a second example, let's try the distance correlation from dcor

library .

Performance:

import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt

returns:

while:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2), 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

array([[1.00, 0.31, 0.86, 0.83, 0.78],
       [0.31, 1.00, 0.54, 0.51, 0.51],
       [0.86, 0.54, 1.00, 0.97, 0.95],
       [0.83, 0.51, 0.97, 1.00, 0.95],
       [0.78, 0.51, 0.95, 0.95, 1.00]])

0

MyCarta 24 Sep 19 at 16:47

source to share

S Anand · Accepted Answer · 2015-04-19T15:33:35+0000

The key question here is which distance metric to use.

Let's say this is your data.

>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
   0   1   2   3   4   5   6   7   8   9  ...  40  41  42  43  44  45  46  47  \
0   1   1   1 NaN   1 NaN NaN   1   1   1 ...   1   1 NaN   1 NaN   1   1   1
1   1   1   1 NaN   1   1   1   1   1   1 ... NaN   1   1 NaN NaN   1   1   1
2   1   1   1   1   1   1   1   1   1   1 ...   1 NaN   1   1   1   1   1 NaN
3   1 NaN   1 NaN   1 NaN   1 NaN   1   1 ...   1   1   1   1 NaN   1   1   1
4   1   1   1   1   1   1   1   1 NaN   1 ... NaN   1   1   1   1   1   1   1

What is the% difference?

You can calculate the distance metric as a percentage of the values that differ between each column. The result shows the% difference between any two columns.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
     0     1     2     3     4     5     6     7     8     9   ...     40  \
0  0.00  0.36  0.33  0.37  0.32  0.41  0.35  0.33  0.39  0.33  ...   0.37
1  0.36  0.00  0.37  0.29  0.30  0.37  0.33  0.37  0.33  0.31  ...   0.35
2  0.33  0.37  0.00  0.36  0.29  0.38  0.40  0.34  0.30  0.28  ...   0.28
3  0.37  0.29  0.36  0.00  0.29  0.30  0.34  0.26  0.32  0.36  ...   0.36
4  0.32  0.30  0.29  0.29  0.00  0.31  0.35  0.29  0.29  0.25  ...   0.27

What is the correlation coefficient?

Here we are using Pearson's correlation coefficient. This is a perfectly valid metric. In particular, it is converted to phi factor in the case of binary data.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  1.000000  0.013158  0.026262 -0.059786 -0.024293 -0.078056  0.054074
1  0.013158  1.000000 -0.093109  0.170159  0.043187  0.027425  0.108148
2  0.026262 -0.093109  1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786  0.170159 -0.124540  1.000000  0.004245  0.184153  0.042524
4 -0.024293  0.043187 -0.048485  0.004245  1.000000  0.079196 -0.099834

By the way, this is the same result as with the Spearman R coefficient.

What is Euclidean distance?

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  0.000000  6.000000  5.744563  6.082763  5.656854  6.403124  5.916080
1  6.000000  0.000000  6.082763  5.385165  5.477226  6.082763  5.744563
2  5.744563  6.082763  0.000000  6.000000  5.385165  6.164414  6.324555
3  6.082763  5.385165  6.000000  0.000000  5.385165  5.477226  5.830952
4  5.656854  5.477226  5.385165  5.385165  0.000000  5.567764  5.916080

You will now have a sense of the picture. Create a method distance

. Then apply it in pairs to each column using

data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))

If your method distance

relies on zeros instead of nan

s, convert to zeros with .fillna(0)

.

Distance matrix for rows in pandas dataframe

What is the% difference?

What is the correlation coefficient?

What is Euclidean distance?

More articles: