How to group by column and return values โ€‹โ€‹of other columns as lists in pandas?

I am facing a problem concatenating the values โ€‹โ€‹of a column and keeping the corresponding values โ€‹โ€‹of other columns. I would like to do something similar to this: grouping strings in a list in pandas groupby

But instead, I want the list / dictionary (preferably the latter) to contain the values โ€‹โ€‹of multiple columns. An example for this data block:

DF:

Col1   Col2   Col3
A      xyz     1
A      pqr     2
B      xyz     2
B      pqr     3
B      lmn     1
C      pqr     2

      

I want something like -

A {'xyz':1, 'pqr': 2}
B {'xyz':2, 'pqr': 3, 'lmn': 1}
C {'pqr':2}

      

I tried doing

df.groupby('Col1')[['Col2', 'Col3']].apply(list) 

      

which is a variation of the solution mentioned in the linked post, but doesn't give me the result I want.

From now on, I would also like to convert it to a dataframe of the form:

  xyz  pqr  lmn
A  1    2    NaN
B  2    3    1
C  NaN  2    NaN

      

+3


source to share


3 answers


Use pivot

or unstack

:

df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

      


df = df.set_index(['Col1','Col2'])['Col3'].unstack()
print (df)

Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

      

but if:

ValueError: Index contains duplicate entries, cannot change shape

it means duplicates, need pivot_table

or fill with groupby

to mean

(can be changed to sum

, median

) and last time changed to unstack

:

print (df)
  Col1 Col2  Col3
0    A  xyz     1 <-same A, xyz
1    A  xyz     5 <-same A, xyz
2    A  pqr     2
3    B  xyz     2
4    B  pqr     3
5    B  lmn     1
6    C  pqr     2

df = df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  3.0 (1+5)/2 = 3
B     1.0  3.0  2.0
C     NaN  2.0  NaN

      



EDIT:

To check all duplicates for Col1

and Col2

:

print (df[df.duplicated(subset=['Col1','Col2'], keep=False)])
  Col1 Col2  Col3
0    A  xyz     1
1    A  xyz     5

      

EDIT1:

If only the first line is required, if duplicates:

df = df.groupby(['Col1','Col2'])['Col3'].first().unstack()
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

      

Or is it better to remove the duplicates first drop_duplicates

and then use the first or second solution:

df = df.drop_duplicates(subset=['Col1','Col2'])
df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

      

+1


source


At the end you will receive a pivot table.

df.pivot_table(index='Col1',columns='Col2',values='Col3')



find additional documentation for documentation.

+1


source


None of these are pandas

just solutions. I've provided them because I find exploring alternatives fun. The basic solution is bincount

very fast, but less transparent.

Creative solution 1
collections.defaultdict

and vocabulary comprehension

from collections import defaultdict

d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)

   lmn  pqr  xyz
A  NaN    2  1.0
B  1.0    3  2.0
C  NaN    2  NaN

      


Creative solution 2
pd.factorize

andnp.bincount

f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values

n, m = u1.size, u2.size

v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)

   lmn  pqr  xyz
A  NaN    2  1.0
B  1.0    3  2.0
C  NaN    2  NaN

      


Timing

%timeit df.pivot(index='Col1',columns='Col2',values='Col3')
%timeit df.set_index(['Col1','Col2'])['Col3'].unstack()
%timeit df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
%timeit df.pivot_table(index='Col1',columns='Col2',values='Col3')

%%timeit
d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)

%%timeit
f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values

n, m = u1.size, u2.size

v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)

      

small data

1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.67 ms per loop
1000 loops, best of 3: 1.51 ms per loop
100 loops, best of 3: 4.17 ms per loop

1000 loops, best of 3: 1.18 ms per loop

1000 loops, best of 3: 420 ยตs per loop

      

average data

from string import ascii_letters
l = list(ascii_letters)
df = pd.DataFrame(dict(
        Col1=np.random.choice(l, 10000),
        Col2=np.random.choice(l, 10000),
        Col3=np.random.randint(10, size=10000)
    )).drop_duplicates(['Col1', 'Col2'])

1000 loops, best of 3: 1.75 ms per loop
100 loops, best of 3: 2.17 ms per loop
100 loops, best of 3: 2.2 ms per loop
100 loops, best of 3: 4.89 ms per loop

100 loops, best of 3: 5.6 ms per loop

1000 loops, best of 3: 549 ยตs per loop

      

+1


source







All Articles