Select values from any column / row based on criteria
I have a correlation matrix as a dataframe. Something like:
xyz abc def
xyz 1 0.1 -0.2
abc 0.1 1 0.3
def -0.2 0.3 1
I need to be able to select all values above or below a certain threshold, but of course they can be in any row or column.
For example, select all values that are greater than 0.2. There are two results:
(def, abc) and (abc, def)
I'm not sure how to do this as it involves looking for values based on criteria in each row / column. Ideally, the output should be in a format that easily identifies pairs (ex: a list of tuples or something like that)
edit: oh, and of course all the same columns / rows will also be in the results of the above example (ex: xyz / xyz, abc / abc, def / def)
source to share
Here is one use np.triu
to mask the upper triangular matrix and transform the correlation matrix to stack
.
import pandas as pd
import numpy as np
# simulate some data to generate corr_mat
# ==============================================
np.random.seed(0)
data = np.random.multivariate_normal([0,0,0], [[1,0.1,-0.2],[0.1,1,0.3],[-0.2,0.3,1]], 10000)
df = pd.DataFrame(data, columns='xyz abc def'.split())
corr_mat = df.corr()
corr_mat
xyz abc def
xyz 1.0000 0.1216 -0.1901
abc 0.1216 1.0000 0.3014
def -0.1901 0.3014 1.0000
# processing
# =======================================
# mask on lower-triangle only
mask = np.ones_like(corr_mat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = False
mask
array([[False, False, False],
[ True, False, False],
[ True, True, False]], dtype=bool)
# reshape the correlation matrix, and select corr > 0.2
corr_stacked = corr_mat.stack()
corr_stacked[(corr_stacked > 0.2) & (mask.ravel())]
def abc 0.3014
dtype: float64
# you can reset_index to put multi-level index to columns
source to share