Select values ​​from any column / row based on criteria

I have a correlation matrix as a dataframe. Something like:

       xyz   abc  def
xyz    1     0.1  -0.2
abc    0.1   1    0.3
def    -0.2  0.3  1

      

I need to be able to select all values ​​above or below a certain threshold, but of course they can be in any row or column.

For example, select all values ​​that are greater than 0.2. There are two results:

(def, abc) and (abc, def)

I'm not sure how to do this as it involves looking for values ​​based on criteria in each row / column. Ideally, the output should be in a format that easily identifies pairs (ex: a list of tuples or something like that)

edit: oh, and of course all the same columns / rows will also be in the results of the above example (ex: xyz / xyz, abc / abc, def / def)

+3


source to share


3 answers


Here is one use np.triu

to mask the upper triangular matrix and transform the correlation matrix to stack

.



import pandas as pd
import numpy as np

# simulate some data to generate corr_mat
# ==============================================
np.random.seed(0)
data = np.random.multivariate_normal([0,0,0], [[1,0.1,-0.2],[0.1,1,0.3],[-0.2,0.3,1]], 10000)
df = pd.DataFrame(data, columns='xyz abc def'.split())
corr_mat = df.corr()
corr_mat

        xyz     abc     def
xyz  1.0000  0.1216 -0.1901
abc  0.1216  1.0000  0.3014
def -0.1901  0.3014  1.0000

# processing
# =======================================
# mask on lower-triangle only
mask = np.ones_like(corr_mat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = False
mask

array([[False, False, False],
       [ True, False, False],
       [ True,  True, False]], dtype=bool)

# reshape the correlation matrix, and select corr > 0.2
corr_stacked = corr_mat.stack()
corr_stacked[(corr_stacked > 0.2) & (mask.ravel())]

def  abc    0.3014
dtype: float64

# you can reset_index to put multi-level index to columns

      

+2


source


Flatten 2d-1d tuples ((string, col), val). Sort by val. retrieve (row, col) tuples, val> 0.2



+1


source


Assuming your dataframe is "df" and your threshold is "value", you can do something like:

df[df>value] or

      

df[df>value].dropna(axis=1, how="all")

if you want to delete columns without matches.

+1


source







All Articles