Count occurrences of specific values โ€‹โ€‹in a data frame, where all possible values โ€‹โ€‹are specified by a list

I have two categories A and B that can take 5 different states (values, names or categories) defined by the list ABCDE . Counting the presence of each state and its storage in a data frame is fairly straightforward. However, I also would like to see the resulting data frame includes zeros for the possible values, which are not met in the categories A or Bed and .

First, here's a data frame that matches the description:

IN 1]:

import pandas as pd
possibleValues = list('abcde')
df = pd.DataFrame({'Category A':list('abbc'), 'Category B':list('abcc')})
print(df)

      

Out [1]:

        Category A      Category B
0       a               a
1       b               b
2       b               c
3       c               c

      

I have tried different approaches with df.groupby(...).size()

and .count()

, combined with a list of possible values โ€‹โ€‹and category names in the list, with no success.

Here's the desired output:

        Category A      Category B
a       1               1
b       2               1
c       1               2
d       0               0
e       0               0

      

To take it one step further, I would also like to include a column with totals for each possible state across all categories:

        Category A      Category B      Total
a       1               1               2
b       2               1               3
c       1               2               3
d       0               0               0
e       0               0               0

      

SO has many related questions and answers, but as far as I know, none of them offer a solution to this specific problem. Thanks for any suggestions!

PS

I want the solution to be tuned for the number of categories, the possible values, and the number of rows.

+3


source to share


1 answer


Up Need apply

+ value_counts

+ reindex

+ sum

:

cols = ['Category A','Category B']
df1 = df[cols].apply(pd.value_counts).reindex(possibleValues, fill_value=0)
df1['total'] = df1.sum(axis=1)
print (df1)
   Category A  Category B  total
a           1           1      2
b           2           1      3
c           1           2      3
d           0           0      0
e           0           0      0

      



Another solution is to convert the columns to categorical and then the 0

values โ€‹โ€‹are added without reindex

:

cols = ['Category A','Category B']
df1 = df[cols].apply(lambda x: pd.Series.value_counts(x.astype('category', 
                                                                categories=possibleValues)))
df1['total'] = df1.sum(axis=1)
print (df1)
   Category A  Category B  total
a           1           1      2
b           2           1      3
c           1           2      3
d           0           0      0
e           0           0      0

      

+3


source







All Articles