Groupby and count the number of unique values (Pandas)

Question

Groupby and count the number of unique values (Pandas)

I have a dataframe with two variables: ID

and outcome

. First I try groupby

ID

and count the number of unique values outcome

inside this ID

.

df
ID    outcome
1      yes
1      yes
1      yes
2      no
2      yes
2      no

Expected Result:

ID    yes    no
1      3     0
2      1     2

My code df[['PID', 'outcome']].groupby('PID')['outcome'].nunique()

specifies the number of the most unique value, for example:

ID
1   2
2   2

But I need some calculations yes

and no

how can I achieve this? Thank!

+3

python pandas count unique dataframe

Kay 03 Aug 17 at 21:55

source to share

5 answers

Option
pd.factorize

2+ np.bincount

It's confusing and painful ... but very fast.

fi, ui = pd.factorize(df.ID.values)
fo, uo = pd.factorize(df.outcome.values)

n, m = ui.size, uo.size
pd.DataFrame(
    np.bincount(fi * m + fo, minlength=n * m).reshape(n, m),
    pd.Index(ui, name='ID'), pd.Index(uo, name='outcome')
)

outcome  yes  no
ID              
1          3   0
2          1   2

Option C

pd.get_dummies(d.ID).T.dot(pd.get_dummies(d.outcome))

   no  yes
1   0    3
2   2    1

Option IV .

df.groupby(['ID', 'outcome']).size().unstack(fill_value=0)

+4

piRSquared 03 Aug 17 at 22:06

source to share

Group in a column ID

and then aggregate with value_counts

in a column outcome

. This will result in a series, so you will need to convert it back to the framework with .to_frame()

so that you can unlock yes / no (i.e. have them as columns). Then fill in the null values with zero.

df_total = df.groupby('ID')['outcome'].value_counts().to_frame().unstack(fill_value=0)
df_total.columns = df_total.columns.droplevel()
>>> df_total
outcome  no  yes
ID              
1         0    3
2         2    1

+1

Alexander 03 Aug 17 at 21:58

source to share

Use set_index

andpd.concat

df1 = df.set_index('ID')
pd.concat([df1.outcome.eq('yes').sum(level=0),
          df1.outcome.ne('yes').sum(level=0)], keys=['yes','no'],axis=1).reset_index()

Output:

   ID  yes   no
0   1  3.0  0.0
1   2  1.0  2.0

0

Scott boston 03 Aug 17 at 22:05

source to share

An efficient MOST setup that prevents any past, present and future errors and takes advantage of FAST-vectorized functions is to do the (insanely simple) following:

df['dummy_yes'] = df.outcome == 'yes'
df['dummy_no'] = df.outcome == 'no'

df.groupby('ID').sum()

0

ℕʘʘḆḽḘ 03 Aug 17 at 22:05

source to share

coldspeed · Accepted Answer · 2017-08-03T22:04:35+0000

How about pd.crosstab

?

In [1217]: pd.crosstab(df.ID, df.outcome)
Out[1217]: 
outcome  no  yes
ID              
1         0    3
2         2    1

Groupby and count the number of unique values ​​(Pandas)

More articles:

Groupby and count the number of unique values (Pandas)