Combine data by date after truncation

Thanks for looking at this ....

It is necessary to reduce the accuracy of the IoT Sensor data timestamps and merge.

I have two csv files with the following data

CSV-1

datetime,temperature
2017-06-13 22:20:11.309,82.4
2017-06-13 22:19:54.004,82.4
2017-06-13 22:19:36.661,82.4
2017-06-13 22:19:19.359,82.4

      

CSV-2

datetime,humidity
2017-06-13 22:07:30.723,63.0
2017-06-13 22:07:13.448,63.0
2017-06-13 22:06:56.115,63.0
2017-06-13 22:06:38.806,63.0

      

Note that datetime entries are relative to the millisecond. I am using the following code to reduce the precision to seconds.

ugt = pd.read_csv('ugt.csv', parse_dates=True, index_col=0)
ugh = pd.read_csv('ugh.csv', parse_dates=True, index_col=0)

ugt.index = ugt.index.map(lambda x: x.replace(microsecond=0))
ugh.index = ugh.index.map(lambda x: x.replace(microsecond=0))

      

This creates the following data frames:

                     temperature
datetime                        
2017-06-13 22:06:57         82.4 <---
2017-06-13 22:06:37         82.4
2017-06-13 22:06:20         82.4
2017-06-13 22:06:03         82.0 <---

                 humidity
datetime                     
2017-06-13 22:06:57      63.0 <---
2017-06-13 22:06:38      63.0
2017-06-13 22:06:21      63.0
2017-06-13 22:06:03      63.0 <---

      

Please note that some timestamps are the same (see <---) with the second, others do not. this is due to the limitation of the ability of the various sensors to take readings. There is no constant frequency.

We then create a master framework that is populated with strings in the afternoon for the period of time that we collected data from all sensors.

                     temperature  humidity
2017-04-25 12:00:00            0         0
2017-04-25 12:00:01            0         0
2017-04-25 12:00:02            0         0
2017-04-25 12:00:03            0         0
2017-04-25 12:00:04            0         0

      

We can't figure out how to use pandas concat, merge, add two csv files to the main datetime based dataframe. We need the following:

                     temperature  humidity
2017-04-25 12:00:00            0         0
2017-04-25 12:00:01            82.0      0
2017-04-25 12:00:02            0         44.0
2017-04-25 12:00:03            0         0
2017-04-25 12:00:04            82.0      44.0
2017-04-25 12:00:05            0         0
2017-04-25 12:00:06            82.0      0
2017-04-25 12:00:07            0         0
2017-04-25 12:00:08            82.0      44.0

      

There are additional sensors that we will add in the future ... light, CO2, so almost every second will have a column of data over time.

We also want to do some analysis on how the frequency of the various sensors is able to collect data and their accuracy, hence the use of the main data set.

You all rock! Thanks for your help.

+3


source to share


2 answers


I believe the solution to your problem would be using pd.join ().

df_joined = ugt.join(ugh, how='outer')

                      temperature    humidity                     
2017-06-13 22:06:03          82.0    63.0
2017-06-13 22:06:20          82.4    NaN
2017-06-13 22:06:21           NaN    63.0
2017-06-13 22:06:37          82.4    NaN
2017-06-13 22:06:38           NaN    63.0
2017-06-13 22:06:57          82.4    63.0

      

After the connection fills the main dataframe by looping through the combined dataframe and using an index to define each row:



for index, row in df_joined.iterrows():
    df_master.loc[index,'humidity'] = row['humidity']
    df_master.loc[index,'temperature'] = row['temperature']

      

I have no output because I didn't create a master framework, but it should work

0


source


temp (temperature):

                 datetime  temperature
0  2017-06-13 22:20:11.309         82.4
1  2017-06-13 22:19:54.004         82.4
2  2017-06-13 22:19:36.661         82.4
3  2017-06-13 22:19:19.359         82.4

      

wet frame:

                 datetime  humidity
0  2017-06-13 22:07:30.723      63.0
1  2017-06-13 22:07:13.448      63.0
2  2017-06-13 22:06:56.115      63.0
3  2017-06-13 22:06:38.806      63.0



temp.datetime = pd.to_datetime(temp.datetime) #convert to datetime dtype
temp.set_index('datetime', inplace=True) #make it the index
temp.index = temp.index.round('S') #and now round to the second

      

Now the temp time frame looks like this:

                     temperature
datetime                        
2017-06-13 22:20:11         82.4
2017-06-13 22:19:54         82.4
2017-06-13 22:19:37         82.4
2017-06-13 22:19:19         82.4

      

Do the same for wet df:

humid.datetime = pd.to_datetime(humid.datetime) 
humi.set_index('datetime', inplace=True) 
humid.index = humid.index.round('S') 

      

Now wet:



                     humidity
datetime                     
2017-06-13 22:07:31      63.0
2017-06-13 22:07:13      63.0
2017-06-13 22:06:56      63.0
2017-06-13 22:06:39      63.0

      

Reindex temp, replace dates as you see fit:

temp = temp.reindex(pd.DatetimeIndex(start='2017-06-13 22:00', end='2017-06-13 22:20', freq='S'))
temp.head()

                     temperature
2017-06-13 22:00:00          NaN
2017-06-13 22:00:01          NaN
2017-06-13 22:00:02          NaN
2017-06-13 22:00:03          NaN
2017-06-13 22:00:04          NaN

      

And now on the left join:

out = pd.merge(temp, humid, left_index=True, right_index=True, how='left')

out.head():
                     temperature  humidity
2017-06-13 22:00:00          NaN       NaN
2017-06-13 22:00:01          NaN       NaN
2017-06-13 22:00:02          NaN       NaN
2017-06-13 22:00:03          NaN       NaN
2017-06-13 22:00:04          NaN       NaN

      

Make sure this actually worked:

out.loc['2017-06-13 22:07:31']
                     temperature  humidity
2017-06-13 22:07:31          NaN      63.0

      

Hooray!

+1


source







All Articles