Splitting histograms in Pandas
I am reading a csv file through pandas and making simple bar charts like this:
df = pd.read_csv(sys.argv[1],header=0)
hFare = df['Fare'].dropna().hist(bins=[0,10,20,30,45,60,75,100,600],label = "All")
hSurFare = df[df.Survived==1]['Fare'].dropna().hist(bins=[0,10,20,30,45,60,75,100,600],label="Survivors")
What I would like is to have a bin to bin ratio of two histograms. Is there an easy way to do this?
+3
source to share
1 answer
First, we'll create some sample data. In the future, if you ask a question about pandas, your best bet is to include example data that people can easily copy-paste into their Python console:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Fare': np.random.uniform(0, 600, 400),
'Survived': np.random.randint(0, 2, 400)})
Then use pd.cut
to flatten the data just like you did in your histogram:
df['fare_bin'] = pd.cut(df['Fare'], bins=[0,10,20,30,45,60,75,100,600])
Look at the total and number of survivors in each bunker (you can probably do this as separate columns, but I just do it quickly):
df.groupby('fare_bin').apply(lambda g: (g.shape[0], g.loc[g['Survived'] == 1, :].shape[0]))
Out[34]:
fare_bin
(0, 10] (7, 4)
(10, 20] (9, 6)
(100, 600] (326, 156)
(20, 30] (5, 4)
(30, 45] (12, 6)
(45, 60] (15, 11)
(60, 75] (13, 7)
(75, 100] (13, 6)
dtype: object
Then write a quick function to get the ratio:
def get_ratio(g):
try:
return float(g.shape[0]) / g.loc[g['Survived'] == 1, :].shape[0]
except ZeroDivisionError:
return np.nan
df.groupby('fare_bin').apply(get_ratio)
Out[30]:
fare_bin
(0, 10] 1.750000
(10, 20] 1.500000
(100, 600] 2.089744
(20, 30] 1.250000
(30, 45] 2.000000
(45, 60] 1.363636
(60, 75] 1.857143
(75, 100] 2.166667
dtype: float64
+3
source to share