Columnless concatenation in Pandas

Question

Columnless concatenation in Pandas

I have two data frames with temporary data. I want to select all values where both frames have a timestamp with a difference less than a certain threshold.

For example, data 1 and 2 looks like this, except for the various, unpredictable clock values:

   clock      head        px        py        pz        qw         
0      0.000000 -0.316579  0.119198  0.149585  0.271688  0.987492 -0.002514   
1      0.200000 -0.316642  0.119212  0.149593  0.271678  0.987487 -0.002522   
2      1.200000 -0.316546  0.119199  0.149585  0.271669  0.987495 -0.002507   


   clock      head        px        py        pz        qw         
0      0.010000 -0.316579  0.119198  0.149585  0.271688  0.987492 -0.002514   
1      1.1040000 -0.316642  0.119212  0.149593  0.271678  0.987487 -0.002522   
2      2.4030000 -0.316546  0.119199  0.149585  0.271669  0.987495 -0.002507

The resulting framework looks like an assuming threshold of 0.1:

   clock      head1        head2        px1        px2        ...         
0      0.000000 -0.316579 -0.316579  0.119198  0.119198  ...
1      1.200000 -0.316546 -0.316642  0.119199  0.119212  ...

My current methodology is to create an identical "filler" column in both dataframes, concatenate into that column (creating len (dataframe1) * len (dataframe2)) and then filter on the columns I want:

#rename the dataframe keys so that they are different
dataframe1.columns = [i+str(1) for i in dataframe1.columns.values]
dataframe1['filler'] = 0
dataframe2.columns = [i+str(2) for i in dataframe2.columns.values]
dataframe2['filler'] = 0
# merge requires a column to merge on, so merge on the filler
df_merged = dataframe1.merge(dataframe2,on='filler',how='left')
#pick out only the rows with the time differences within the threshold
mask = (df_merged[keyword+str(1)]<= df_merged[keyword+str(2)]+threshold) & (df_merged[keyword+str(1)]> df_merged[keyword+str(2)]-threshold)
df_merged = df_merged[mask]
#delete the filler column
del df_merged['filler']
#reindex the dataframe
df_merged.index = arange(0, len(df_merged))

It is very fast and gives me the result I want, however it is silly to create a "filler" column that I have to delete. I am wondering if there was a more obvious solution that I missed.

The merge with the "keyword" column doesn't give me what I want, it only creates a dataframe with complete data, only if the time is exactly the same, no threshold on time difference.

+3

python pandas

Catherine holloway Jul 24 15 at 15:14

source to share

1 answer

Adam Smith · Answer 1 · 2015-07-24T19:37:46+0000

You can use np.where

to change the column data df2

clock

to match df1

if it is within the threshold before fuzzy match.

import pandas as pd
import numpy as np

# THE TEST DATA YOU GAVE US -------------------------
columns = ['clock', 'head', 'px', 'py', 'pz', 'qw']

series1 = [(0.0, 0.1, 0.5),
           (-0.316579, -0.316642, -0.316546),
           (0.119198, 0.119212, 0.119199),
           (0.149585, 0.149593, 0.149585),
           (0.271688, 0.271678, 0.271669),
           (0.987492, 0.987487, 0.987495),
           (-0.002514, -0.002522, -0.002507)]

series2 = [(0.01, 0.104, 0.403),
           (-0.316579, -0.316642, -0.316546),
           (0.119198, 0.119212, 0.119199),
           (0.149585, 0.149593, 0.149585),
           (0.271688, 0.271678, 0.271669),
           (0.987492, 0.987487, 0.987495),
           (-0.002514, -0.002522, -0.002507)]
# THE TEST DATA YOU GAVE US -------------------------

df1 = pd.DataFrame(dict(zip(columns, series1)))
df2 = pd.DataFrame(dict(zip(columns, series2)))

threshold = 0.99

df2['clock'] = np.where(
    abs(df1['clock'] - df2['clock']) < threshold, df1['clock'], df2['clock'])

merged_df = df1.merge(df2, on='clock', how='outer')
print(merged_df)

   clock    head_x      px_x      py_x      pz_x      qw_x    head_y        px_y      py_y      pz_y      qw_y   
0    0.0 -0.316579  0.119198  0.149585  0.271688  0.987492 -0.316579 0  0.119198  0.149585  0.271688  0.987492   
1    0.1 -0.316642  0.119212  0.149593  0.271678  0.987487 -0.316642 1  0.119212  0.149593  0.271678  0.987487   
2    0.5 -0.316546  0.119199  0.149585  0.271669  0.987495 -0.316546 2  0.119199  0.149585  0.271669  0.987495

It makes sense not to concatenate rows that do not match the threshold, so if you had DataFrames that also had data rows with df1['clock'] == 6

and df2['clock'] == 7

(outside the threshold 0.99

), you would end up with two more rows: clock == 6

and all _y

full NaN

s, and one s clock == 7

and all _x's

complete NaN

s

Columnless concatenation in Pandas

More articles: