Pandas - conditionally merge data with multiple columns

I have 2 dataframes and I want to take one of the columns from one and create a new column in the second based on the values ​​in several (other) columns

First data frame ( df1

):

df1 = pd.DataFrame({'cond': np.repeat([1,2], 5),
                    'point': np.tile(np.arange(1,6), 2),
                    'value1': np.random.rand(10),
                    'unused1': np.random.rand(10)})

   cond  point   unused1    value1
0     1      1  0.923699  0.103046
1     1      2  0.046528  0.188408
2     1      3  0.677052  0.481349
3     1      4  0.464000  0.807454
4     1      5  0.180575  0.962032
5     2      1  0.941624  0.437961
6     2      2  0.489738  0.026166
7     2      3  0.739453  0.109630
8     2      4  0.338997  0.415101
9     2      5  0.310235  0.660748

      

and the second ( df2

):

df2 = pd.DataFrame({'cond': np.repeat([1,2], 10),
                    'point': np.tile(np.arange(1,6), 4),
                    'value2': np.random.rand(20)})

    cond  point    value2
0      1      1  0.990252
1      1      2  0.534813
2      1      3  0.407325
3      1      4  0.969288
4      1      5  0.085832
5      1      1  0.922026
6      1      2  0.567615
7      1      3  0.174402
8      1      4  0.469556
9      1      5  0.511182
10     2      1  0.219902
11     2      2  0.761498
12     2      3  0.406981
13     2      4  0.551322
14     2      5  0.727761
15     2      1  0.075048
16     2      2  0.159903
17     2      3  0.726013
18     2      4  0.848213
19     2      5  0.284404

      

df1['value1']

contains values ​​for each combination of cond

and point

.

I want to create a new column ( new_column

) in df2

that contains the values ​​from df1['value1']

, but the values ​​must be where cond

and where the point

two dataframes correspond.

So my desired output looks like this:

    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

      

For this example, I could just use tile / repeat, but it df1['value1']

doesn't actually fit neatly into another frame. So I just need to do it based on the column mapping cond

andpoint

I tried to merge them, but 1) the numbers don't seem to match, and 2) I don't want to carry over any unused columns from df1

:

df1.merge(df2, left_on=['cond', 'point'], right_on=['cond', 'point'])

What is the correct way to add this new column without having to iterate over 2 blocks of data?

+3


source to share


2 answers


Option 1
For grace and speed with clean pandas

we can use This will give the same result as all the other options, as shown below. lookup


The concept is to represent the search data as a two-dimensional array and indexed search values.

d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))

      

Option 2
We can do the same with numpy

to improve performance if the values ​​are presented the same as in df1

. It's very fast!

a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])

      

Option 3 The
canonical answer is to use with a parameter But we need to cook a little to nail the exit merge

left


df1

d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')

      

Option 4
I thought it was fun. Create a cartographic dictionary and series to display on
Good for small data, not good for big data. See the graph below.

c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}

c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})

df2.assign(new_column=s2.map(m))

      




OUTPUT

    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

      


Timing
small data

%%timeit 
a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])
1000 loops, best of 3: 304 µs per loop

%%timeit
d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))
100 loops, best of 3: 1.8 ms per loop

%%timeit
c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}
c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})
df2.assign(new_column=s2.map(m))
1000 loops, best of 3: 719 µs per loop

%%timeit
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')
100 loops, best of 3: 2.04 ms per loop

%%timeit
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 2.01 ms per loop

%%timeit
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 2.15 ms per loop

      

big data

df2 = pd.concat([df2] * 10000, ignore_index=True)

%%timeit 
a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])
1000 loops, best of 3: 1.93 ms per loop

%%timeit
d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))
100 loops, best of 3: 5.58 ms per loop

%%timeit
c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}
c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})
df2.assign(new_column=s2.map(m))
10 loops, best of 3: 135 ms per loop

%%timeit
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')
100 loops, best of 3: 13.4 ms per loop

%%timeit
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df.rename(columns={'value1': 'new_column'})
10 loops, best of 3: 19.8 ms per loop

%%timeit
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 18.2 ms per loop

      

+2


source


You can use merge

with left join

and drop

to remove the column unused1

, the last column rename

:

Note. The parameter on

can be omitted if both DataFrames

use only the same join columns. If you have more of the same column names, add on=['cond', 'point']

.

df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df = df.rename(columns={'value1': 'new_column'})
print (df)
    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

      



Another solution with join

(default left join

) with set_index

+ drop

:

df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df = df.rename(columns={'value1': 'new_column'})
print (df)
    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

      

+2


source







All Articles