Building a pandas dataframe with strings relying on not existing in another python dataframe

I have a pandas dataframe

df
store    day   items
 a        1     4
 a        1     3
 a        2     1
 a        3     5
 a        4     2 
 a        5     9
 b        1     1 
 b        2     3

      

I also have pandas dataframe temp

which is a kronecker product of all unique storage-day combinations, that is, it looks like this:

    store  day  
0     a    1     
1     a    2      
2     a    3      
3     a    4      
4     a    5      
5     b    1      
6     b    2      
7     b    3    
8     b    4    
9     b    5    

      

I want to create a new DF that is missing observations in df

, that is, combinations store-day

that are not present in df

but presented in temp

.

desired output


store    day
b         3      
b         4       
b         5      

      

+3


source to share


3 answers


This is one way



gcols = ['store', 'date']
tmp[tmp.set_index(gcols).index.isin(df.set_index(gcols).index) == False]

      

+2


source


My solution concatenates two dataframes and uses items

as a marker column - for the rows we want, there will be nan

. I believe that for large data frames this will be more efficient than the alternative using isin

. If it were items

n't, I would add a marker column in df

.

So, merge first. It is important to point how = 'left'

out that we get lines from tmp

that are not on df

:

out = tmp.merge(df, on= ['store', 'day'], how = 'left')

In [23]: out
Out[23]: 
   store  day  items
0      a    1      4
1      a    1      3
2      a    2      1
3      a    3      5
4      a    4      2
5      a    5      9
6      b    1      1
7      b    2      3
8      b    3    NaN
9      b    4    NaN
10     b    5    NaN

      



You can see that the rows we want to get nan

for their column items

since they were only merged with tmp

. Now leave them and get rid of the marker column.

out[out['items'].isnull()].drop(['items'], axis = 1)

   store  day
8      b    3
9      b    4
10     b    5

      

+2


source


newDF = pd.merge (df, temp, how = 'right', on = ['store', 'day'])

newDF [newDF.isnull (). Any (axis = 1)]

0


source







All Articles