Resampling and combining data with python
Hi I have created a dataFrame dictionary with this code
import os
import pandas
import glob
path="G:\my_dir\*"
dataList={}
for files in glob.glob(path):
dataList[files]=(read_csv(files,sep=";",index_col='Date'))
Different data frames, available as a courtesy, have different timing samples. DataFrame (A) example -
Date Volume Value
2014-01-04 06:00:02 6062 108000.0
2014-01-04 06:06:05 6062 107200.0
2014-01-04 06:12:07 6062 97400.0
2014-01-04 06:18:10 6062 99200.0
2014-01-04 06:24:12 6062 91300.0
2014-01-04 06:30:14 6062 84100.0
2014-01-04 06:36:17 6062 57000.0
DataFrame (B) example -
Date Volume Value
2014-01-04 05:52:50 6062 4.7
2014-01-04 05:58:53 6062 4.7
2014-01-04 06:04:56 6062 4.9
2014-01-04 06:10:58 6062 5.1
2014-01-04 06:17:01 6062 5.2
2014-01-04 06:23:03 6062 5.2
2014-01-04 06:29:05 6062 5.5
2014-01-04 06:35:08 6062 5.5
Different data frames do not have the same number of lines. I want to combine different data frames in one:
Data Volume B A Value(DataframeN)
2014/04/01 05:52:50 6062 4.70 NaN
2014/04/01 05:58:53 6062 4.70 NaN
2014/04/01 06:04:56 6062 4.90 107465.51
2014/04/01 06:10:58 6062 5.10 100652.60
2014/04/01 06:17:01 6062 5.20 98899.57
2014/04/01 06:23:03 6062 5.20 92618.56
2014/04/01 06:29:05 6062 5.50 85301.73
2014/04/01 06:35:08 6062 5.50 61523.06
I made it easy with Matlab using command
ts_A=timeseries(ValueA,datenum(DateA));
ts_B=timeseries(ValueB,datenum(DateB));
res_A=resample(ts_A,datenum(DateB));
I need to do this for multiple sets of csv files, so I want to automate the process with python.
Tpx
source to share
You can concat
two DataFrames
, interpolate
then reindex
on DataFrame
which one you want.
I am assuming we have a specific number DataFrames
, where Date
is a DateTimeIndex
in all of them. I'll use two in this example since you used two questions, but the code will work for any number.
df_a
:
Volume Value
Date
2014-01-04 06:00:02 6062 108000
2014-01-04 06:06:05 6062 107200
2014-01-04 06:12:07 6062 97400
2014-01-04 06:18:10 6062 99200
2014-01-04 06:24:12 6062 91300
2014-01-04 06:30:14 6062 84100
2014-01-04 06:36:17 6062 57000
df_b
:
Volume Value
Date
2014-01-04 05:52:50 6062 4.7
2014-01-04 05:58:53 6062 4.7
2014-01-04 06:04:56 6062 4.9
2014-01-04 06:10:58 6062 5.1
2014-01-04 06:17:01 6062 5.2
2014-01-04 06:23:03 6062 5.2
2014-01-04 06:29:05 6062 5.5
2014-01-04 06:35:08 6062 5.5
And I will give them as an dict
example. You read them directly in dict
, so you don't need to take this step. I just want to show you how my example is formatted dict
. dict
keys
it doesn't matter, any valid one will work dict
key
:
dataList = {'a': df_a,
'b': df_b}
This will lead us to where you are now, with mine dataList
, which will hopefully be in the same format as yours.
The first thing you need to do is merge DataFrames
. I am using column names dict
keys
like MultiIndex
so that you can keep track of which instance of a given column came from DataFrame
. You can do it like this:
df = pd.concat(dataList.values(), axis=1, keys=dataList.keys())
This gives you the DataFrame
following:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.7
2014-01-04 05:58:53 NaN NaN 6062 4.7
2014-01-04 06:00:02 6062 108000 NaN NaN
2014-01-04 06:04:56 NaN NaN 6062 4.9
2014-01-04 06:06:05 6062 107200 NaN NaN
2014-01-04 06:10:58 NaN NaN 6062 5.1
2014-01-04 06:12:07 6062 97400 NaN NaN
2014-01-04 06:17:01 NaN NaN 6062 5.2
2014-01-04 06:18:10 6062 99200 NaN NaN
2014-01-04 06:23:03 NaN NaN 6062 5.2
2014-01-04 06:24:12 6062 91300 NaN NaN
2014-01-04 06:29:05 NaN NaN 6062 5.5
2014-01-04 06:30:14 6062 84100 NaN NaN
2014-01-04 06:35:08 NaN NaN 6062 5.5
2014-01-04 06:36:17 6062 57000 NaN NaN
Then you need to interpolate to fill in the missing values. I interpolate with 'time'
mode
so that it handles the time indices correctly:
df = df.interpolate('time')
This gives you the DataFrame
following:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.700000
2014-01-04 05:58:53 NaN NaN 6062 4.700000
2014-01-04 06:00:02 6062 108000.000000 6062 4.738017
2014-01-04 06:04:56 6062 107352.066116 6062 4.900000
2014-01-04 06:06:05 6062 107200.000000 6062 4.938122
2014-01-04 06:10:58 6062 99267.955801 6062 5.100000
2014-01-04 06:12:07 6062 97400.000000 6062 5.119008
2014-01-04 06:17:01 6062 98857.851240 6062 5.200000
2014-01-04 06:18:10 6062 99200.000000 6062 5.200000
2014-01-04 06:23:03 6062 92805.801105 6062 5.200000
2014-01-04 06:24:12 6062 91300.000000 6062 5.257182
2014-01-04 06:29:05 6062 85472.375691 6062 5.500000
2014-01-04 06:30:14 6062 84100.000000 6062 5.500000
2014-01-04 06:35:08 6062 62151.239669 6062 5.500000
2014-01-04 06:36:17 6062 57000.000000 6062 5.500000
I think it is best to stop here as you keep all data from all files csv
. But you said you only want the time points from the longest csv
. To get this, you need to find the longest DataFrame
one and then get the lines that match its indices. Finding the longest one is DataFrame
easy, you just find the one that has the maximum length. Keeping only the timing in this is index
also easy, you just slice it with index
(you use a method loc
to slice this way).
longind = max(dataList.values(), key=len).index
df = df.loc[longind]
This gives you the following final DataFrame
:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.7
2014-01-04 05:58:53 NaN NaN 6062 4.7
2014-01-04 06:04:56 6062 107352.066116 6062 4.9
2014-01-04 06:10:58 6062 99267.955801 6062 5.1
2014-01-04 06:17:01 6062 98857.851240 6062 5.2
2014-01-04 06:23:03 6062 92805.801105 6062 5.2
2014-01-04 06:29:05 6062 85472.375691 6062 5.5
2014-01-04 06:35:08 6062 62151.239669 6062 5.5
This can be combined into one line if you want:
df = pd.concat(dataList.values(), axis=1, keys=dataList.keys()).interpolate('time').loc[max(dataList.values(), key=len).index]
Or perhaps a somewhat crisp 4 lines:
names = dataList.keys()
dfs = dataList.values()
longind = max(dfs, key=len).index
df = pd.concat(dfs, axis=1, keys=names).interpolate('time').loc[longind]
I am not sure why my final results differ from what you are showing. I followed your example in MATLAB
(R2015A) myself and got the same results as me, so I suspect you have created the final data with a different dataset than this example.
source to share