Sorting bins from pandas cut

Using pandas cut I can define cells by providing edges, and pandas creates type bins (a, b]

.

My question is, how can I sort the bins (lowest to highest)?

import numpy as np
import pandas as pd

y = pd.Series(np.random.randn(100))

x1 = pd.Series(np.sign(np.random.randn(100)))
x2 = pd.cut(pd.Series(np.random.randn(100)), bins = [-3, -0.5, 0, 0.5, 3])

model = pd.concat([y, x1, x2], axis = 1, keys = ['Y', 'X1', 'X2'])

      

I have an intermediate result where the order of the bins is preserved

int_output = model.groupby(['X1', 'X2']).mean().unstack()
int_output.columns = int_output.columns.get_level_values(1)

X2    (-3, -0.5]  (-0.5, 0]  (0, 0.5]  (0.5, 3]
X1                                             
-1.0    0.101475  -0.344419 -0.482992 -0.015179
 1.0    0.249961   0.484757 -0.066383 -0.249414

      

But then I do other operations that arbitrarily change the bin order:

output = pd.concat(int_output.to_dict('series'), axis = 1)

      (-0.5, 0]  (-3, -0.5]  (0, 0.5]  (0.5, 3]
X1                                             
-1.0  -0.344419    0.101475 -0.482992 -0.015179
 1.0   0.484757    0.249961 -0.066383 -0.249414

      

Now, I would like to plot the data on a histogram, but I want the cells to sort from smallest (-3, -0.5) to highest (0.5, 3).

I think I can achieve this by manipulating the string using separating by "," and then clearing the brackets, but I would like to know if there is a better way.

+3


source to share


2 answers


The main problem is to lose ordered

CategoricalIndex

.

np.random.seed(12456)
y = pd.Series(np.random.randn(100))
x1 = pd.Series(np.sign(np.random.randn(100)))
x2 = pd.cut(pd.Series(np.random.randn(100)), bins = [-3, -0.5, 0, 0.5, 3])

model = pd.concat([y, x1, x2], axis = 1, keys = ['Y', 'X1', 'X2'])
int_output = model.groupby(['X1', 'X2']).mean().unstack()
int_output.columns = int_output.columns.get_level_values(1)

print (int_output)
X2    (-3, -0.5]  (-0.5, 0]  (0, 0.5]  (0.5, 3]
X1                                             
-1.0    0.230060  -0.079266 -0.079834 -0.064455
 1.0   -0.451351   0.268688  0.020091 -0.280218

print (int_output.columns)
CategoricalIndex(['(-3, -0.5]', '(-0.5, 0]', '(0, 0.5]', '(0.5, 3]'], 
                 categories=['(-3, -0.5]', '(-0.5, 0]', '(0, 0.5]', '(0.5, 3]'], 
                 ordered=True, name='X2', dtype='category')

output = pd.concat(int_output.to_dict('series'), axis = 1)
print (output)
      (-0.5, 0]  (-3, -0.5]  (0, 0.5]  (0.5, 3]
X1                                             
-1.0  -0.079266    0.230060 -0.079834 -0.064455
 1.0   0.268688   -0.451351  0.020091 -0.280218

print (output.columns)
Index(['(-0.5, 0]', '(-3, -0.5]', '(0, 0.5]', '(0.5, 3]'], dtype='object')

      



One possible solution is the extract

first number from output.columns

, create a sub-series and sort it. Last reindex

source columns:

cat = output.columns.str.extract('\((.*),', expand=False).astype(float)
a = pd.Series(cat, index=output.columns).sort_values()
print (a)
(-3, -0.5]   -3.0
(-0.5, 0]    -0.5
(0, 0.5]      0.0
(0.5, 3]      0.5
dtype: float64

output = output.reindex(columns=a.index)
print (output)
      (-3, -0.5]  (-0.5, 0]  (0, 0.5]  (0.5, 3]
X1                                             
-1.0    0.230060  -0.079266 -0.079834 -0.064455
 1.0   -0.451351   0.268688  0.020091 -0.280218

      

+2


source


An easy fix for the problem you highlighted above is to simply reorder the columns:



output[sorted(output.columns)]

      

0


source







All Articles