Group in pandas by filling in missing groups with []
Any help on a more accurate title of this question is appreciated.
I have a framework pandas
with client level observations that record the date and the items consumed by the client on that date. It looks like this.
df
store day items
a 1 4
a 1 3
a 2 1
a 3 5
a 4 2
a 5 9
b 1 1
b 2 3
Each observation in this dataset refers to a unique day-store combination, but each observation in the store is nominally assigned a positive number of items consumed, ie. df[items] > 0
for every couple of days of the store.
So I don't have, for example
b 3 0
b 4 0
b 5 0
and etc.
I need to group this data file with store
and day
and then do some operations on the whole survey in each storage group.
But I want these strings to exist with length 0 as well (zero sets) and I'm not sure if this is the best way. This is a very basic toy dataset. The real one is very big.
I really don't want to add to observations BEFORE using df.groupby(['store', 'day'])
it because I am running OTHER calculations for each day storage group that uses the length of each group as a measure of the number of customers recorded at a specific store and per day. So if I add to these observations b3
and b4
it looks like there were 2 customers who visited store b on days 3 and 4 - when they weren't there (everyone didn't buy anything from store b on days 3 and 4).
source to share
You may already have an answer to your question if someone else, like me, is looking for an answer. Try:
pd.crosstab(df.store, df.day, margins=False)
This will give you a df with storage as index and day as column. you can do something like:
df.reset_index(level=0, inplace=True)
to convert the index to a column and if you have multiple columns of the index like:
df.columns = [''.join(col).strip() for col in df.columns.values]
to get a "flat" df.
You can do it:
pd.crosstab([df.store, df.day.....], [df.store, df.day.....],margins=False)
source to share
The pandas 'way of representing' would probably be to encode it as missing data, e.g .:
In [562]: df
Out[562]:
store day items
0 a 1 4
1 a 1 3
2 a 2 1
3 a 3 5
4 a 4 2
5 a 5 9
6 b 1 1
7 b 2 3
8 b 3 NaN
9 b 4 NaN
Then, in your aggregate for counting customers, you can use count
that excludes missing values, for example:
In [565]: df.groupby('store')['items'].count()
Out[565]:
store
a 6
b 2
Name: items, dtype: int64
EDIT:
In terms of adding missing values, here are a couple of thoughts. Let's say you have a DataFrame containing only missing pairs, for example:
In [571]: df_missing
Out[571]:
store day
8 b 3
9 b 4
Then you can simply add them to the existing DataFrame to fill in the missing ones, for example:
In [574]: pd.concat([df, df_missing], ignore_index=True)
Out[574]:
day items store
0 1 4 a
1 1 3 a
2 2 1 a
3 3 5 a
4 4 2 a
5 5 9 a
6 1 1 b
7 2 3 b
8 3 NaN b
9 4 NaN b
Alternatively, if you are a DataFrame with the pairs you should have, (1-5, b 1-4), you can combine that with the data to fill in the missing ones. For example:
In [577]: df_pairs
Out[577]:
store day
0 a 1
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
In [578]: df_pairs.merge(df, how='left')
Out[578]:
store day items
0 a 1 4
1 a 1 3
2 a 1 4
3 a 1 3
4 a 2 1
5 a 3 5
6 a 4 2
7 a 5 9
8 b 1 1
9 b 2 3
10 b 3 NaN
11 b 4 NaN
source to share