Group in pandas by filling in missing groups with []

Any help on a more accurate title of this question is appreciated.

I have a framework pandas

with client level observations that record the date and the items consumed by the client on that date. It looks like this.

df
store    day   items
 a        1     4
 a        1     3
 a        2     1
 a        3     5
 a        4     2 
 a        5     9
 b        1     1 
 b        2     3

      

Each observation in this dataset refers to a unique day-store combination, but each observation in the store is nominally assigned a positive number of items consumed, ie. df[items] > 0

for every couple of days of the store.

So I don't have, for example

b         3      0
b         4      0 
b         5      0

      

and etc.

I need to group this data file with store

and day

and then do some operations on the whole survey in each storage group.

But I want these strings to exist with length 0 as well (zero sets) and I'm not sure if this is the best way. This is a very basic toy dataset. The real one is very big.

I really don't want to add to observations BEFORE using df.groupby(['store', 'day'])

it because I am running OTHER calculations for each day storage group that uses the length of each group as a measure of the number of customers recorded at a specific store and per day. So if I add to these observations b3

and b4

it looks like there were 2 customers who visited store b on days 3 and 4 - when they weren't there (everyone didn't buy anything from store b on days 3 and 4).

+3


source to share


3 answers


You may already have an answer to your question if someone else, like me, is looking for an answer. Try:

pd.crosstab(df.store, df.day, margins=False)

      

This will give you a df with storage as index and day as column. you can do something like:

df.reset_index(level=0, inplace=True) 

      

to convert the index to a column and if you have multiple columns of the index like:



df.columns = [''.join(col).strip() for col in df.columns.values]

      

to get a "flat" df.

You can do it:

pd.crosstab([df.store, df.day.....], [df.store, df.day.....],margins=False)

      

+1


source


The pandas 'way of representing' would probably be to encode it as missing data, e.g .:

In [562]: df
Out[562]: 
  store  day  items
0     a    1      4
1     a    1      3
2     a    2      1
3     a    3      5
4     a    4      2
5     a    5      9
6     b    1      1
7     b    2      3
8     b    3    NaN
9     b    4    NaN

      

Then, in your aggregate for counting customers, you can use count

that excludes missing values, for example:

In [565]: df.groupby('store')['items'].count()
Out[565]: 
store
a        6
b        2
Name: items, dtype: int64

      

EDIT:

In terms of adding missing values, here are a couple of thoughts. Let's say you have a DataFrame containing only missing pairs, for example:



In [571]: df_missing
Out[571]: 
  store  day
8     b    3
9     b    4

      

Then you can simply add them to the existing DataFrame to fill in the missing ones, for example:

In [574]: pd.concat([df, df_missing], ignore_index=True)
Out[574]: 
   day  items store
0    1      4     a
1    1      3     a
2    2      1     a
3    3      5     a
4    4      2     a
5    5      9     a
6    1      1     b
7    2      3     b
8    3    NaN     b
9    4    NaN     b

      

Alternatively, if you are a DataFrame with the pairs you should have, (1-5, b 1-4), you can combine that with the data to fill in the missing ones. For example:

In [577]: df_pairs
Out[577]: 
  store  day
0     a    1
1     a    1
2     a    2
3     a    3
4     a    4
5     a    5
6     b    1
7     b    2
8     b    3
9     b    4

In [578]: df_pairs.merge(df, how='left')
Out[578]: 
   store  day  items
0      a    1      4
1      a    1      3
2      a    1      4
3      a    1      3
4      a    2      1
5      a    3      5
6      a    4      2
7      a    5      9
8      b    1      1
9      b    2      3
10     b    3    NaN
11     b    4    NaN

      

0


source


I don't know the best way to store null values, but you can create them when aggregating:

df.pivot_table('items', 'store', 'day', fill_value=0, aggfunc='count')

      

or

df.groupby(['store', 'day']).count().unstack().fillna(0)

      

0


source







All Articles