Columnless concatenation in Pandas
I have two data frames with temporary data. I want to select all values ββwhere both frames have a timestamp with a difference less than a certain threshold.
For example, data 1 and 2 looks like this, except for the various, unpredictable clock values:
clock head px py pz qw
0 0.000000 -0.316579 0.119198 0.149585 0.271688 0.987492 -0.002514
1 0.200000 -0.316642 0.119212 0.149593 0.271678 0.987487 -0.002522
2 1.200000 -0.316546 0.119199 0.149585 0.271669 0.987495 -0.002507
clock head px py pz qw
0 0.010000 -0.316579 0.119198 0.149585 0.271688 0.987492 -0.002514
1 1.1040000 -0.316642 0.119212 0.149593 0.271678 0.987487 -0.002522
2 2.4030000 -0.316546 0.119199 0.149585 0.271669 0.987495 -0.002507
The resulting framework looks like an assuming threshold of 0.1:
clock head1 head2 px1 px2 ...
0 0.000000 -0.316579 -0.316579 0.119198 0.119198 ...
1 1.200000 -0.316546 -0.316642 0.119199 0.119212 ...
My current methodology is to create an identical "filler" column in both dataframes, concatenate into that column (creating len (dataframe1) * len (dataframe2)) and then filter on the columns I want:
#rename the dataframe keys so that they are different
dataframe1.columns = [i+str(1) for i in dataframe1.columns.values]
dataframe1['filler'] = 0
dataframe2.columns = [i+str(2) for i in dataframe2.columns.values]
dataframe2['filler'] = 0
# merge requires a column to merge on, so merge on the filler
df_merged = dataframe1.merge(dataframe2,on='filler',how='left')
#pick out only the rows with the time differences within the threshold
mask = (df_merged[keyword+str(1)]<= df_merged[keyword+str(2)]+threshold) & (df_merged[keyword+str(1)]> df_merged[keyword+str(2)]-threshold)
df_merged = df_merged[mask]
#delete the filler column
del df_merged['filler']
#reindex the dataframe
df_merged.index = arange(0, len(df_merged))
It is very fast and gives me the result I want, however it is silly to create a "filler" column that I have to delete. I am wondering if there was a more obvious solution that I missed.
The merge with the "keyword" column doesn't give me what I want, it only creates a dataframe with complete data, only if the time is exactly the same, no threshold on time difference.
source to share
You can use np.where
to change the column data df2
clock
to match df1
if it is within the threshold before fuzzy match.
import pandas as pd
import numpy as np
# THE TEST DATA YOU GAVE US -------------------------
columns = ['clock', 'head', 'px', 'py', 'pz', 'qw']
series1 = [(0.0, 0.1, 0.5),
(-0.316579, -0.316642, -0.316546),
(0.119198, 0.119212, 0.119199),
(0.149585, 0.149593, 0.149585),
(0.271688, 0.271678, 0.271669),
(0.987492, 0.987487, 0.987495),
(-0.002514, -0.002522, -0.002507)]
series2 = [(0.01, 0.104, 0.403),
(-0.316579, -0.316642, -0.316546),
(0.119198, 0.119212, 0.119199),
(0.149585, 0.149593, 0.149585),
(0.271688, 0.271678, 0.271669),
(0.987492, 0.987487, 0.987495),
(-0.002514, -0.002522, -0.002507)]
# THE TEST DATA YOU GAVE US -------------------------
df1 = pd.DataFrame(dict(zip(columns, series1)))
df2 = pd.DataFrame(dict(zip(columns, series2)))
threshold = 0.99
df2['clock'] = np.where(
abs(df1['clock'] - df2['clock']) < threshold, df1['clock'], df2['clock'])
merged_df = df1.merge(df2, on='clock', how='outer')
print(merged_df)
clock head_x px_x py_x pz_x qw_x head_y px_y py_y pz_y qw_y
0 0.0 -0.316579 0.119198 0.149585 0.271688 0.987492 -0.316579 0 0.119198 0.149585 0.271688 0.987492
1 0.1 -0.316642 0.119212 0.149593 0.271678 0.987487 -0.316642 1 0.119212 0.149593 0.271678 0.987487
2 0.5 -0.316546 0.119199 0.149585 0.271669 0.987495 -0.316546 2 0.119199 0.149585 0.271669 0.987495
It makes sense not to concatenate rows that do not match the threshold, so if you had DataFrames that also had data rows with df1['clock'] == 6
and df2['clock'] == 7
(outside the threshold 0.99
), you would end up with two more rows: clock == 6
and all _y
full NaN
s, and one s clock == 7
and all _x's
complete NaN
s
source to share