Vectorized method for synchronizing two arrays
I have two Pandas TimeSeries: x
and y
, which I would like to sync with "on state". I would like to find, for each element, to the x
last (by index) element in y
that precedes it (by index value). For example, I would like to calculate this new_x
:
x new_x
---- -----
13:01 13:00
14:02 14:00
y
----
13:00
13:01
13:30
14:00
I'm looking for a vectorial solution, not a Python loop. Time values ββare based on Numpy datetime64
. The array y
is on the order of millions, so O (n ^ 2) solutions are probably not practical.
source to share
In some circles this operation is called "asof" concatenation. Here's the implementation :
def diffCols(df1, df2):
""" Find columns in df1 not present in df2
Return df1.columns - df2.columns maintaining the order which the resulting
columns appears in df1.
Parameters:
----------
df1 : pandas dataframe object
df2 : pandas dataframe objct
Pandas already offers df1.columns - df2.columns, but unfortunately
the original order of the resulting columns is not maintained.
"""
return [i for i in df1.columns if i not in df2.columns]
def aj(df1, df2, overwriteColumns=True, inplace=False):
""" KDB+ like asof join.
Finds prevailing values of df2 asof df1 index. The resulting dataframe
will have same number of rows as df1.
Parameters
----------
df1 : Pandas dataframe
df2 : Pandas dataframe
overwriteColumns : boolean, default True
The columns of df2 will overwrite the columns of df1 if they have the same
name unless overwriteColumns is set to False. In that case, this function
will only join columns of df2 which are not present in df1.
inplace : boolean, default False.
If True, adds columns of df2 to df1. Otherwise, create a new dataframe with
columns of both df1 and df2.
*Assumes both df1 and df2 have datetime64 index. """
joiner = lambda x : x.asof(df1.index)
if not overwriteColumns:
# Get columns of df2 not present in df1
cols = diffCols(df2, df1)
if len(cols) > 0:
df2 = df2.ix[:,cols]
result = df2.apply(joiner)
if inplace:
for i in result.columns:
df1[i] = result[i]
return df1
else:
return result
Used internally pandas.Series.asof()
.
source to share
How about using Series.searchsorted()
to return the index y
where you are pasting x
. Then you can subtract one from this value and use it for indexing y
.
In [1]: x
Out[1]:
0 1301
1 1402
In [2]: y
Out[2]:
0 1300
1 1301
2 1330
3 1400
In [3]: y[y.searchsorted(x)-1]
Out[3]:
0 1300
3 1400
Note: the above example uses the int64 series
source to share