Interpolating values ​​from a data frame based on a column value

Assuming I have the following problem:

import pandas as pd
import numpy as np

xp = [0.0, 0.5, 1.0]

np.random.seed(100)
df = pd.DataFrame(np.random.rand(10, 4), columns=['x0', 'y1', 'y2', 'y3'])

df
      x0     y1     y2     y3
0 0.5434 0.2784 0.4245 0.8448
1 0.0047 0.1216 0.6707 0.8259
2 0.1367 0.5751 0.8913 0.2092
3 0.1853 0.1084 0.2197 0.9786
4 0.8117 0.1719 0.8162 0.2741
5 0.4317 0.9400 0.8176 0.3361
6 0.1754 0.3728 0.0057 0.2524
7 0.7957 0.0153 0.5988 0.6038
8 0.1051 0.3819 0.0365 0.8904
9 0.9809 0.0599 0.8905 0.5769

      

I would like to interpolate a named column interp

. The x-coordinate value for interpolation is contained in the column x0

, the x xp

-coordinate of the data points will be , and the y-coordinates of the data points will be contained in y1

, y2

and y3

.

So far I have come up with the following:

df['interp'] = df.apply(lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)

df
      x0     y1     y2     y3  interp
0 0.5434 0.2784 0.4245 0.8448  0.4610
1 0.0047 0.1216 0.6707 0.8259  0.1268
2 0.1367 0.5751 0.8913 0.2092  0.6616
3 0.1853 0.1084 0.2197 0.9786  0.1496
4 0.8117 0.1719 0.8162 0.2741  0.4783
5 0.4317 0.9400 0.8176 0.3361  0.8344
6 0.1754 0.3728 0.0057 0.2524  0.2440
7 0.7957 0.0153 0.5988 0.6038  0.6018
8 0.1051 0.3819 0.0365 0.8904  0.3093
9 0.9809 0.0599 0.8905 0.5769  0.5889

      

However, the dataframe on which this calculation will be performed contains over a million rows, so I would like a faster method than apply

. Any ideas?

np.interp

only seems to accept 1-D arrays, and this is the reason I went with apply

.

+3


source to share


1 answer


One of the good solutions for this is pandas.DataFrame.eval()

:

TL; DR

Seconds per number of rows
Rows:     100   1000  10000    1E5    1E6    1E7
apply:  0.076  0.734  7.812
eval:   0.056  0.053  0.058  0.087  0.338  2.887

      

As you can see from these timings, eval()

has a lot of setup overhead, and up to 10,000 lines basically takes the same time. But it is two orders of magnitude faster than it is used, and hence, it is definitely worth the overhead for large datasets.

What it is?

From ( DOCS )

pandas.eval(expr, parser='pandas', engine=None, truediv=True, 
            local_dict=None, global_dict=None, resolvers=(),
            level=0, target=None, inplace=None)

      

Evaluate Python expression as string using various servers.

The following arithmetic operations are supported: +, -, *, /, **,%, // (for python only) along with the following logical operations: | (or) and (and), and ~ (not). In addition, the "pandas" parser allows you to use and, or, rather than with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave in the same way as simple Python evaluations.



Tricks performed for this question:

The code below takes advantage of the fact that interpolation is always done on only two segments. It actually computes the interpolation for both segments and then discards the unused segment by multiplying with a bool test (i.e. 0, 1)

The actual expression passed to eval is the following:

((y2-y1) / 0.5 * (x0-0.0) + y1) * (x0 < 0.5)+((y3-y2) / 0.5 * (x0-0.5) + y2) * (x0 >= 0.5)

      

Code:

import pandas as pd
import numpy as np

xp = [0.0, 0.5, 1.0]

np.random.seed(100)

def method1():
    df['interp'] = df.apply(
        lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)

def method2():
    exp = '((y%d-y%d) / %s * (x0-%s) + y%d) * (x0 %s 0.5)'
    exp_1 = exp % (2, 1, xp[1] - xp[0], xp[0], 1, '<')
    exp_2 = exp % (3, 2, xp[2] - xp[1], xp[1], 2, '>=')

    df['interp2'] = df.eval(exp_1 + '+' + exp_2)

from timeit import timeit

def runit(stmt):
    print("%s: %.3f" % (
        stmt, timeit(stmt + '()', number=10,
                     setup='from __main__ import ' + stmt)))

def runit_size(size):
    global df
    df = pd.DataFrame(
        np.random.rand(size, 4), columns=['x0', 'y1', 'y2', 'y3'])

    print('Rows: %d' % size)
    if size <= 10000:
        runit('method1')
    runit('method2')

for i in (100, 1000, 10000, 100000, 1000000, 10000000):
    runit_size(i)

print(df.head())

      

Results:

         x0        y1        y2        y3    interp   interp2
0  0.060670  0.949837  0.608659  0.672003  0.908439  0.908439
1  0.462774  0.704273  0.181067  0.647582  0.220021  0.220021
2  0.568109  0.954138  0.796690  0.585310  0.767897  0.767897
3  0.455355  0.738452  0.812236  0.927291  0.805648  0.805648
4  0.826376  0.029957  0.772803  0.521777  0.608946  0.608946

      

+2


source







All Articles