How to group by column and return values of other columns as lists in pandas?

Question

How to group by column and return values of other columns as lists in pandas?

I am facing a problem concatenating the values of a column and keeping the corresponding values of other columns. I would like to do something similar to this: grouping strings in a list in pandas groupby

But instead, I want the list / dictionary (preferably the latter) to contain the values of multiple columns. An example for this data block:

DF:

Col1   Col2   Col3
A      xyz     1
A      pqr     2
B      xyz     2
B      pqr     3
B      lmn     1
C      pqr     2

I want something like -

A {'xyz':1, 'pqr': 2}
B {'xyz':2, 'pqr': 3, 'lmn': 1}
C {'pqr':2}

I tried doing

df.groupby('Col1')[['Col2', 'Col3']].apply(list)

which is a variation of the solution mentioned in the linked post, but doesn't give me the result I want.

From now on, I would also like to convert it to a dataframe of the form:

  xyz  pqr  lmn
A  1    2    NaN
B  2    3    1
C  NaN  2    NaN

+3

python pandas

Melsauce 08 jul. 17 at 4:01

source to share

3 answers

At the end you will receive a pivot table.

df.pivot_table(index='Col1',columns='Col2',values='Col3')

find additional documentation for documentation.

+1

Victor Chubukov 08 jul. 17 at 4:05

source to share

None of these are pandas

just solutions. I've provided them because I find exploring alternatives fun. The basic solution is bincount

very fast, but less transparent.

Creative solution 1
collections.defaultdict

and vocabulary comprehension

from collections import defaultdict

d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)

   lmn  pqr  xyz
A  NaN    2  1.0
B  1.0    3  2.0
C  NaN    2  NaN

Creative solution 2
pd.factorize

andnp.bincount

f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values

n, m = u1.size, u2.size

v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)

   lmn  pqr  xyz
A  NaN    2  1.0
B  1.0    3  2.0
C  NaN    2  NaN

Timing

%timeit df.pivot(index='Col1',columns='Col2',values='Col3')
%timeit df.set_index(['Col1','Col2'])['Col3'].unstack()
%timeit df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
%timeit df.pivot_table(index='Col1',columns='Col2',values='Col3')

%%timeit
d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)

%%timeit
f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values

n, m = u1.size, u2.size

v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)

small data

1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.67 ms per loop
1000 loops, best of 3: 1.51 ms per loop
100 loops, best of 3: 4.17 ms per loop

1000 loops, best of 3: 1.18 ms per loop

1000 loops, best of 3: 420 µs per loop

average data

from string import ascii_letters
l = list(ascii_letters)
df = pd.DataFrame(dict(
        Col1=np.random.choice(l, 10000),
        Col2=np.random.choice(l, 10000),
        Col3=np.random.randint(10, size=10000)
    )).drop_duplicates(['Col1', 'Col2'])

1000 loops, best of 3: 1.75 ms per loop
100 loops, best of 3: 2.17 ms per loop
100 loops, best of 3: 2.2 ms per loop
100 loops, best of 3: 4.89 ms per loop

100 loops, best of 3: 5.6 ms per loop

1000 loops, best of 3: 549 µs per loop

+1

piRSquared 08 jul. 17 at 5:42

source to share

jezrael · Accepted Answer · 2017-07-08T04:07:04+0000

Use pivot

or unstack

:

df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

df = df.set_index(['Col1','Col2'])['Col3'].unstack()
print (df)

Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

but if:

ValueError: Index contains duplicate entries, cannot change shape

it means duplicates, need pivot_table

or fill with groupby

to mean

(can be changed to sum

, median

) and last time changed to unstack

:

print (df)
  Col1 Col2  Col3
0    A  xyz     1 <-same A, xyz
1    A  xyz     5 <-same A, xyz
2    A  pqr     2
3    B  xyz     2
4    B  pqr     3
5    B  lmn     1
6    C  pqr     2

df = df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  3.0 (1+5)/2 = 3
B     1.0  3.0  2.0
C     NaN  2.0  NaN

EDIT:

To check all duplicates for Col1

and Col2

:

print (df[df.duplicated(subset=['Col1','Col2'], keep=False)])
  Col1 Col2  Col3
0    A  xyz     1
1    A  xyz     5

EDIT1:

If only the first line is required, if duplicates:

df = df.groupby(['Col1','Col2'])['Col3'].first().unstack()
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

Or is it better to remove the duplicates first drop_duplicates

and then use the first or second solution:

df = df.drop_duplicates(subset=['Col1','Col2'])
df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

How to group by column and return values ​​of other columns as lists in pandas?

More articles:

How to group by column and return values of other columns as lists in pandas?