Building a pandas dataframe with strings relying on not existing in another python dataframe
I have a pandas dataframe
df
store day items
a 1 4
a 1 3
a 2 1
a 3 5
a 4 2
a 5 9
b 1 1
b 2 3
I also have pandas dataframe temp
which is a kronecker product of all unique storage-day combinations, that is, it looks like this:
store day
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 1
6 b 2
7 b 3
8 b 4
9 b 5
I want to create a new DF that is missing observations in df
, that is, combinations store-day
that are not present in df
but presented in temp
.
desired output
store day
b 3
b 4
b 5
source to share
My solution concatenates two dataframes and uses items
as a marker column - for the rows we want, there will be nan
. I believe that for large data frames this will be more efficient than the alternative using isin
. If it were items
n't, I would add a marker column in df
.
So, merge first. It is important to point how = 'left'
out that we get lines from tmp
that are not on df
:
out = tmp.merge(df, on= ['store', 'day'], how = 'left')
In [23]: out
Out[23]:
store day items
0 a 1 4
1 a 1 3
2 a 2 1
3 a 3 5
4 a 4 2
5 a 5 9
6 b 1 1
7 b 2 3
8 b 3 NaN
9 b 4 NaN
10 b 5 NaN
You can see that the rows we want to get nan
for their column items
since they were only merged with tmp
. Now leave them and get rid of the marker column.
out[out['items'].isnull()].drop(['items'], axis = 1)
store day
8 b 3
9 b 4
10 b 5
source to share