Combine data by date after truncation
Thanks for looking at this ....
It is necessary to reduce the accuracy of the IoT Sensor data timestamps and merge.
I have two csv files with the following data
CSV-1
datetime,temperature
2017-06-13 22:20:11.309,82.4
2017-06-13 22:19:54.004,82.4
2017-06-13 22:19:36.661,82.4
2017-06-13 22:19:19.359,82.4
CSV-2
datetime,humidity
2017-06-13 22:07:30.723,63.0
2017-06-13 22:07:13.448,63.0
2017-06-13 22:06:56.115,63.0
2017-06-13 22:06:38.806,63.0
Note that datetime entries are relative to the millisecond. I am using the following code to reduce the precision to seconds.
ugt = pd.read_csv('ugt.csv', parse_dates=True, index_col=0)
ugh = pd.read_csv('ugh.csv', parse_dates=True, index_col=0)
ugt.index = ugt.index.map(lambda x: x.replace(microsecond=0))
ugh.index = ugh.index.map(lambda x: x.replace(microsecond=0))
This creates the following data frames:
temperature
datetime
2017-06-13 22:06:57 82.4 <---
2017-06-13 22:06:37 82.4
2017-06-13 22:06:20 82.4
2017-06-13 22:06:03 82.0 <---
humidity
datetime
2017-06-13 22:06:57 63.0 <---
2017-06-13 22:06:38 63.0
2017-06-13 22:06:21 63.0
2017-06-13 22:06:03 63.0 <---
Please note that some timestamps are the same (see <---) with the second, others do not. this is due to the limitation of the ability of the various sensors to take readings. There is no constant frequency.
We then create a master framework that is populated with strings in the afternoon for the period of time that we collected data from all sensors.
temperature humidity
2017-04-25 12:00:00 0 0
2017-04-25 12:00:01 0 0
2017-04-25 12:00:02 0 0
2017-04-25 12:00:03 0 0
2017-04-25 12:00:04 0 0
We can't figure out how to use pandas concat, merge, add two csv files to the main datetime based dataframe. We need the following:
temperature humidity
2017-04-25 12:00:00 0 0
2017-04-25 12:00:01 82.0 0
2017-04-25 12:00:02 0 44.0
2017-04-25 12:00:03 0 0
2017-04-25 12:00:04 82.0 44.0
2017-04-25 12:00:05 0 0
2017-04-25 12:00:06 82.0 0
2017-04-25 12:00:07 0 0
2017-04-25 12:00:08 82.0 44.0
There are additional sensors that we will add in the future ... light, CO2, so almost every second will have a column of data over time.
We also want to do some analysis on how the frequency of the various sensors is able to collect data and their accuracy, hence the use of the main data set.
You all rock! Thanks for your help.
source to share
I believe the solution to your problem would be using pd.join ().
df_joined = ugt.join(ugh, how='outer')
temperature humidity
2017-06-13 22:06:03 82.0 63.0
2017-06-13 22:06:20 82.4 NaN
2017-06-13 22:06:21 NaN 63.0
2017-06-13 22:06:37 82.4 NaN
2017-06-13 22:06:38 NaN 63.0
2017-06-13 22:06:57 82.4 63.0
After the connection fills the main dataframe by looping through the combined dataframe and using an index to define each row:
for index, row in df_joined.iterrows():
df_master.loc[index,'humidity'] = row['humidity']
df_master.loc[index,'temperature'] = row['temperature']
I have no output because I didn't create a master framework, but it should work
source to share
temp (temperature):
datetime temperature
0 2017-06-13 22:20:11.309 82.4
1 2017-06-13 22:19:54.004 82.4
2 2017-06-13 22:19:36.661 82.4
3 2017-06-13 22:19:19.359 82.4
wet frame:
datetime humidity
0 2017-06-13 22:07:30.723 63.0
1 2017-06-13 22:07:13.448 63.0
2 2017-06-13 22:06:56.115 63.0
3 2017-06-13 22:06:38.806 63.0
temp.datetime = pd.to_datetime(temp.datetime) #convert to datetime dtype
temp.set_index('datetime', inplace=True) #make it the index
temp.index = temp.index.round('S') #and now round to the second
Now the temp time frame looks like this:
temperature
datetime
2017-06-13 22:20:11 82.4
2017-06-13 22:19:54 82.4
2017-06-13 22:19:37 82.4
2017-06-13 22:19:19 82.4
Do the same for wet df:
humid.datetime = pd.to_datetime(humid.datetime)
humi.set_index('datetime', inplace=True)
humid.index = humid.index.round('S')
Now wet:
humidity
datetime
2017-06-13 22:07:31 63.0
2017-06-13 22:07:13 63.0
2017-06-13 22:06:56 63.0
2017-06-13 22:06:39 63.0
Reindex temp, replace dates as you see fit:
temp = temp.reindex(pd.DatetimeIndex(start='2017-06-13 22:00', end='2017-06-13 22:20', freq='S'))
temp.head()
temperature
2017-06-13 22:00:00 NaN
2017-06-13 22:00:01 NaN
2017-06-13 22:00:02 NaN
2017-06-13 22:00:03 NaN
2017-06-13 22:00:04 NaN
And now on the left join:
out = pd.merge(temp, humid, left_index=True, right_index=True, how='left')
out.head():
temperature humidity
2017-06-13 22:00:00 NaN NaN
2017-06-13 22:00:01 NaN NaN
2017-06-13 22:00:02 NaN NaN
2017-06-13 22:00:03 NaN NaN
2017-06-13 22:00:04 NaN NaN
Make sure this actually worked:
out.loc['2017-06-13 22:07:31']
temperature humidity
2017-06-13 22:07:31 NaN 63.0
Hooray!
source to share