Autocomplete using calculation function in python

What I've gotten so far is the code below, and it works great and produces the results it should: it fills in with the df['c']

calculation previous c * b

if not c

. The problem is that I have to apply it to a large data set len(df.index) = ca. 10.000

, so the feature that I still do not fit, because I would have to write a couple of thousand times: df['c'] = df.apply(func, axis =1)

. Loop is while

not a parameter in pandas

this dataset size. Any ideas?

import pandas as pd
import numpy as np
import datetime

randn = np.random.randn
rng = pd.date_range('1/1/2011', periods=10, freq='D')

df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df["c"] =np.NaN

df["c"][0] = 1
df["c"][2] = 3


def func(x):
    if pd.notnull(x['c']):
        return x['c']
    else:
        return df.iloc[df.index.get_loc(x.name) - 1]['c'] * x['b']

df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)

      

+3


source to share


3 answers


Here's a good way to deal with the repetition problem. There will be documents on this in v0.16.2 (release next week). See docs for numba

This is going to be pretty impressive, since the real heavy lifting is done with fast compiled jit-ted code.



import pandas as pd
import numpy as np
from numba import jit

rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': np.nan * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df.ix[0,"c"] = 1
df.ix[2,"c"] = 3

@jit
def ffill(arr_b, arr_c):

    n = len(arr_b)
    assert len(arr_b) == len(arr_c)
    result = arr_c.copy()

    for i in range(1,n):
        if not np.isnan(arr_c[i]):
            result[i] = arr_c[i]
        else:
            result[i] = result[i-1]*arr_b[i]

    return result

df['d'] = ffill(df.b.values, df.c.values)

             a   b   c      d
2011-01-01 NaN   2   1      1
2011-01-02 NaN   3 NaN      3
2011-01-03 NaN  10   3      3
2011-01-04 NaN   3 NaN      9
2011-01-05 NaN   5 NaN     45
2011-01-06 NaN   8 NaN    360
2011-01-07 NaN   4 NaN   1440
2011-01-08 NaN   1 NaN   1440
2011-01-09 NaN   2 NaN   2880
2011-01-10 NaN   6 NaN  17280

      

+4


source


If you print values df

in a for loop:

for i in range(7):
    df['c'] = df.apply(func, axis =1)
    print(df)

      

you can trace the origin of the values ​​in the column c

:

               a   b      c
2011-01-01  None   2      1    1
2011-01-02  None   3      3    3*1
2011-01-03  None  10      3    1*3*1
2011-01-04  None   3      9    3*1*3*1
2011-01-05  None   5     45    5*3*1*3*1
2011-01-06  None   8    360    ...
2011-01-07  None   4   1440    ...
2011-01-08  None   1   1440    ...
2011-01-09  None   2   2880    ...
2011-01-10  None   6  17280    6*2*4*8*5*3*3

      

You can clearly see that the values ​​come from the aggregate product. Each line is the value from the previous line, multiplied by some number. This new number sometimes comes from b

or sometimes 1 (when c

not NaN).

So, if we can create a column d

that has these "new" numbers in it, then the desired values ​​can be calculated using cumprod

:

df['c'] = df['d'].cumprod() 

      




import pandas as pd
import numpy as np
import datetime

randn = np.random.randn

def setup_df():
    rng = pd.date_range('1/1/2011', periods=10, freq='D')
    df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},
                      index=rng)
    df["c"] = np.NaN
    df.iloc[0, -1] = 1
    df.iloc[2, -1] = 3
    return df

df = setup_df()
df['d'] = df['b']
mask = pd.notnull(df['c'])
df.loc[mask, 'd'] = 1
df['c'] = df['d'].cumprod()
print(df)

      

gives

               a   b      c  d
2011-01-01  None   2      1  1
2011-01-02  None   3      3  3
2011-01-03  None  10      3  1
2011-01-04  None   3      9  3
2011-01-05  None   5     45  5
2011-01-06  None   8    360  8
2011-01-07  None   4   1440  4
2011-01-08  None   1   1440  1
2011-01-09  None   2   2880  2
2011-01-10  None   6  17280  6

      

I left the column d

to show where the values ​​came from c

. You can of course remove the column with

del df['d']

      

Or better yet, as chrisaycock points out, you can opt out of defining the d

column altogether and instead use

df['c'] = np.where(pd.notnull(df['c']), 1, df['b']).cumprod()

      

+4


source


You can simply write your write loop like this:

for i in range(1, len(df)):
    if pd.isnull(df.c[i]):
        df.c[i] = df.c[i-1] * df.b[i]

      

If this is too long for you, you can jit

use numba. Your DataFrame example is too small for a meaningful test on my system.

+1


source







All Articles