Compare two pandas frames with different size
I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And the second, less:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe after this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with loops, but the database is massive and the code is very slow, so I'm looking for Pandas -way or numpy for that.
Many thanks,
Boris
source to share
You can use the created one :map
Series
set_index
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge
with drop
and rename
:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
source to share
Here's one vector NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values) df1['C'] = df2.H.values[idx]
idx
could have been easier to compute with df2.G.searchsorted(df1.A)
, but I don't think it would be more efficient, because we want to use the underlying array with .values
for performance, as we did earlier.
source to share