Autocomplete using calculation function in python
What I've gotten so far is the code below, and it works great and produces the results it should: it fills in with the df['c']
calculation previous c * b
if not c
. The problem is that I have to apply it to a large data set len(df.index) = ca. 10.000
, so the feature that I still do not fit, because I would have to write a couple of thousand times: df['c'] = df.apply(func, axis =1)
. Loop is while
not a parameter in pandas
this dataset size. Any ideas?
import pandas as pd
import numpy as np
import datetime
randn = np.random.randn
rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df["c"] =np.NaN
df["c"][0] = 1
df["c"][2] = 3
def func(x):
if pd.notnull(x['c']):
return x['c']
else:
return df.iloc[df.index.get_loc(x.name) - 1]['c'] * x['b']
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
source to share
Here's a good way to deal with the repetition problem. There will be documents on this in v0.16.2 (release next week). See docs for numba
This is going to be pretty impressive, since the real heavy lifting is done with fast compiled jit-ted code.
import pandas as pd
import numpy as np
from numba import jit
rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': np.nan * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df.ix[0,"c"] = 1
df.ix[2,"c"] = 3
@jit
def ffill(arr_b, arr_c):
n = len(arr_b)
assert len(arr_b) == len(arr_c)
result = arr_c.copy()
for i in range(1,n):
if not np.isnan(arr_c[i]):
result[i] = arr_c[i]
else:
result[i] = result[i-1]*arr_b[i]
return result
df['d'] = ffill(df.b.values, df.c.values)
a b c d
2011-01-01 NaN 2 1 1
2011-01-02 NaN 3 NaN 3
2011-01-03 NaN 10 3 3
2011-01-04 NaN 3 NaN 9
2011-01-05 NaN 5 NaN 45
2011-01-06 NaN 8 NaN 360
2011-01-07 NaN 4 NaN 1440
2011-01-08 NaN 1 NaN 1440
2011-01-09 NaN 2 NaN 2880
2011-01-10 NaN 6 NaN 17280
source to share
If you print values df
in a for loop:
for i in range(7):
df['c'] = df.apply(func, axis =1)
print(df)
you can trace the origin of the values ββin the column c
:
a b c
2011-01-01 None 2 1 1
2011-01-02 None 3 3 3*1
2011-01-03 None 10 3 1*3*1
2011-01-04 None 3 9 3*1*3*1
2011-01-05 None 5 45 5*3*1*3*1
2011-01-06 None 8 360 ...
2011-01-07 None 4 1440 ...
2011-01-08 None 1 1440 ...
2011-01-09 None 2 2880 ...
2011-01-10 None 6 17280 6*2*4*8*5*3*3
You can clearly see that the values ββcome from the aggregate product. Each line is the value from the previous line, multiplied by some number. This new number sometimes comes from b
or sometimes 1 (when c
not NaN).
So, if we can create a column d
that has these "new" numbers in it, then the desired values ββcan be calculated using cumprod
:
df['c'] = df['d'].cumprod()
import pandas as pd
import numpy as np
import datetime
randn = np.random.randn
def setup_df():
rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},
index=rng)
df["c"] = np.NaN
df.iloc[0, -1] = 1
df.iloc[2, -1] = 3
return df
df = setup_df()
df['d'] = df['b']
mask = pd.notnull(df['c'])
df.loc[mask, 'd'] = 1
df['c'] = df['d'].cumprod()
print(df)
gives
a b c d
2011-01-01 None 2 1 1
2011-01-02 None 3 3 3
2011-01-03 None 10 3 1
2011-01-04 None 3 9 3
2011-01-05 None 5 45 5
2011-01-06 None 8 360 8
2011-01-07 None 4 1440 4
2011-01-08 None 1 1440 1
2011-01-09 None 2 2880 2
2011-01-10 None 6 17280 6
I left the column d
to show where the values ββcame from c
. You can of course remove the column with
del df['d']
Or better yet, as chrisaycock points out, you can opt out of defining the d
column altogether and instead use
df['c'] = np.where(pd.notnull(df['c']), 1, df['b']).cumprod()
source to share