How do I plot stacked and normalized histograms?

I have a dataset that maps continuous values ​​into discrete categories. I want to display a bar chart with continuous values ​​as x and categories as y, where the bars are grouped and normalized. Example:

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

df = pd.DataFrame({ 
        'score' : np.random.rand(1000), 
        'category' : np.random.choice(list('ABCD'), 1000) 
    },
    columns=['score', 'category'])

print df.head(10)

      

Output:

      score category
0  0.649371        B
1  0.042309        B
2  0.689487        A
3  0.433064        B
4  0.978859        A
5  0.789140        C
6  0.215758        D
7  0.922389        B
8  0.105364        D
9  0.010274        C

      

If I try to plot this as a bar chart using df.hist(by='category')

I get 4 plots:

hist_by_category

I managed to get the graph I wanted, but I had to manipulate it a lot.

# One column per category, 1 if maps to category, 0 otherwise
df2 = pd.DataFrame({
        'score' : df.score,
        'A' : (df.category == 'A').astype(float),
        'B' : (df.category == 'B').astype(float),
        'C' : (df.category == 'C').astype(float),
        'D' : (df.category == 'D').astype(float)
    },
    columns=['score', 'A', 'B', 'C', 'D'])

# select "bins" of .1 width, and sum for each category
df3 = pd.DataFrame([df2[(df2.score >= (n/10.0)) & (df2.score < ((n+1)/10.0))].iloc[:, 1:].sum() for n in range(10)])

# Sum over series for weights
df4 = df3.sum(1)

bars = pd.DataFrame(df3.values / np.tile(df4.values, [4, 1]).transpose(), columns=list('ABCD'))

bars.plot.bar(stacked=True)

      

stacked & normalized

I expect there is an easier way to do this, easier to read and understand, and more streamlined with fewer intermediate steps. Any solutions?

+5


source to share


2 answers


I don't know if this is actually much more compact or readable than what you already got, but this is a suggestion (the latter as such :)).



import numpy as np
import pandas as pd

df = pd.DataFrame({ 
        'score' : np.random.rand(1000), 
        'category' : np.random.choice(list('ABCD'), 1000) 
    }, columns=['score', 'category'])

# Set the range of the score as a category using pd.cut
df.set_index(pd.cut(df['score'], np.linspace(0, 1, 11)), inplace=True)

# Count all entries for all scores and all categories
a = df.groupby([df.index, 'category']).size() 
# Normalize
b = df.groupby(df.index)['category'].count()
df_a = a.div(b, axis=0,level=0)

# Plot
df_a.unstack().plot.bar(stacked=True)

      

+1


source


Try assigning cells with cut

, calculate the grouping percentage with a pair of calls groupby().transform

, and then concatenate and reshape with pivot_table

:

# CREATE BIN INDICATORS
df['plot_bins'] = pd.cut(df['score'], bins=np.arange(0,1.1,0.1), 
                         labels=np.arange(0,1,0.1)).round(1)

# CALCULATE PCT OF CATEGORY OUT OF BINs
df['pct'] = (df.groupby(['plot_bins', 'category'])['score'].transform('count')
               .div(df.groupby(['plot_bins'])['score'].transform('count')))

# PIVOT TO AGGREGATE + RESHAPE
agg_df = (df.pivot_table(index='plot_bins', columns='category', values='pct', aggfunc='max')
            .reset_index(drop=True))
# PLOT
agg_df.plot(kind='bar', stacked=True, rot=0)

      



Stacked Bar% Graph

0


source







All Articles