Conjugate correlation

I have a dataframe that looks something like this:

In [45]: df 
Out[45]: 
   Item_Id  Location_Id  date  price
0        A         5372     1    0.5
1        A         5372     2    NaN
2        A         5372     3    1.0
3        A         6065     1    1.0
4        A         6065     2    1.0
5        A         6065     3    3.0
6        A         7000     1    NaN
7        A         7000     2    NaN
8        A         7000     3    NaN
9        B         5372     1    3.0
10       B         5372     2    NaN
11       B         5372     3    1.0
12       B         6065     1    2.0
13       B         6065     2    1.0
14       B         6065     3    3.0
15       B         7000     1    8.0
16       B         7000     2    NaN
17       B         7000     3    9.0

      

For everyone Item_Id

in each category, Location_Id

I want to calculate the pairwise price correlation between each pair Item_Id

. Note that although I only gave two unique Item_Id

values ​​in the sampled data above, there are dozens of different values ​​it Item_Id

takes in my real data. I tried using groupby.corr()

but that doesn't seem to give me what I want.

Ultimately, I want N dataframes, where N is the number of unique values Location_Id

in df

. Each of the N data frames will be a square price correlation matrix between all pairings Item_Id

present in a particular category Location_Id

. Thus, each of the N data frames will contain J rows and columns, where J is the number of unique values Item_Id

in that particular group Location_Id

.

+3


source to share


1 answer


You can group by Location_Id

, then rotate by date

and Item_Id

and get correlations:

>>> corr = lambda obj: obj.pivot('date', 'Item_Id', 'price').corr()
>>> df.groupby('Location_Id').apply(corr)
Item_Id                  A      B
Location_Id Item_Id              
5372        A        1.000 -1.000
            B       -1.000  1.000
6065        A        1.000  0.866
            B        0.866  1.000
7000        A          NaN    NaN
            B          NaN  1.000

      



and you get a 2 x 2 matrix for each Location_Id

.

+1


source







All Articles