Resampling and combining data with python

Hi I have created a dataFrame dictionary with this code

import os
import pandas 
import glob

path="G:\my_dir\*"
dataList={}

for files in glob.glob(path):
    dataList[files]=(read_csv(files,sep=";",index_col='Date'))

      

Different data frames, available as a courtesy, have different timing samples. DataFrame (A) example -

Date               Volume   Value 
2014-01-04 06:00:02 6062   108000.0
2014-01-04 06:06:05 6062   107200.0
2014-01-04 06:12:07 6062   97400.0
2014-01-04 06:18:10 6062   99200.0
2014-01-04 06:24:12 6062   91300.0
2014-01-04 06:30:14 6062   84100.0
2014-01-04 06:36:17 6062   57000.0

      

DataFrame (B) example -

Date                Volume Value 
2014-01-04 05:52:50  6062   4.7
2014-01-04 05:58:53  6062   4.7
2014-01-04 06:04:56  6062   4.9 
2014-01-04 06:10:58  6062   5.1
2014-01-04 06:17:01  6062   5.2
2014-01-04 06:23:03  6062   5.2
2014-01-04 06:29:05  6062   5.5
2014-01-04 06:35:08  6062   5.5

      

Different data frames do not have the same number of lines. I want to combine different data frames in one:

    Data                 Volume       B               A               Value(DataframeN)
2014/04/01 05:52:50      6062        4.70            NaN
2014/04/01 05:58:53      6062        4.70            NaN
2014/04/01 06:04:56      6062        4.90            107465.51
2014/04/01 06:10:58      6062        5.10            100652.60
2014/04/01 06:17:01      6062        5.20            98899.57
2014/04/01 06:23:03      6062        5.20            92618.56
2014/04/01 06:29:05      6062        5.50            85301.73
2014/04/01 06:35:08      6062        5.50            61523.06

      

I made it easy with Matlab using command

ts_A=timeseries(ValueA,datenum(DateA));
ts_B=timeseries(ValueB,datenum(DateB));
res_A=resample(ts_A,datenum(DateB));

      

I need to do this for multiple sets of csv files, so I want to automate the process with python.

Tpx

+3


source to share


1 answer


You can concat

two DataFrames

, interpolate

then reindex

on DataFrame

which one you want.

I am assuming we have a specific number DataFrames

, where Date

is a DateTimeIndex

in all of them. I'll use two in this example since you used two questions, but the code will work for any number.

df_a

:

                     Volume   Value
Date                               
2014-01-04 06:00:02    6062  108000
2014-01-04 06:06:05    6062  107200
2014-01-04 06:12:07    6062   97400
2014-01-04 06:18:10    6062   99200
2014-01-04 06:24:12    6062   91300
2014-01-04 06:30:14    6062   84100
2014-01-04 06:36:17    6062   57000

      

df_b

:

                     Volume  Value
Date                              
2014-01-04 05:52:50    6062    4.7
2014-01-04 05:58:53    6062    4.7
2014-01-04 06:04:56    6062    4.9
2014-01-04 06:10:58    6062    5.1
2014-01-04 06:17:01    6062    5.2
2014-01-04 06:23:03    6062    5.2
2014-01-04 06:29:05    6062    5.5
2014-01-04 06:35:08    6062    5.5

      

And I will give them as an dict

example. You read them directly in dict

, so you don't need to take this step. I just want to show you how my example is formatted dict

. dict

keys

it doesn't matter, any valid one will work dict

key

:

dataList = {'a': df_a,
            'b': df_b}

      

This will lead us to where you are now, with mine dataList

, which will hopefully be in the same format as yours.

The first thing you need to do is merge DataFrames

. I am using column names dict

keys

like MultiIndex

so that you can keep track of which instance of a given column came from DataFrame

. You can do it like this:

df = pd.concat(dataList.values(), axis=1, keys=dataList.keys())

      

This gives you the DataFrame

following:

                         a              b      
                    Volume   Value Volume Value
Date                                           
2014-01-04 05:52:50    NaN     NaN   6062   4.7
2014-01-04 05:58:53    NaN     NaN   6062   4.7
2014-01-04 06:00:02   6062  108000    NaN   NaN
2014-01-04 06:04:56    NaN     NaN   6062   4.9
2014-01-04 06:06:05   6062  107200    NaN   NaN
2014-01-04 06:10:58    NaN     NaN   6062   5.1
2014-01-04 06:12:07   6062   97400    NaN   NaN
2014-01-04 06:17:01    NaN     NaN   6062   5.2
2014-01-04 06:18:10   6062   99200    NaN   NaN
2014-01-04 06:23:03    NaN     NaN   6062   5.2
2014-01-04 06:24:12   6062   91300    NaN   NaN
2014-01-04 06:29:05    NaN     NaN   6062   5.5
2014-01-04 06:30:14   6062   84100    NaN   NaN
2014-01-04 06:35:08    NaN     NaN   6062   5.5
2014-01-04 06:36:17   6062   57000    NaN   NaN

      



Then you need to interpolate to fill in the missing values. I interpolate with 'time'

mode

so that it handles the time indices correctly:

df = df.interpolate('time')

      

This gives you the DataFrame

following:

                         a                     b          
                    Volume          Value Volume     Value
Date                                                      
2014-01-04 05:52:50    NaN            NaN   6062  4.700000
2014-01-04 05:58:53    NaN            NaN   6062  4.700000
2014-01-04 06:00:02   6062  108000.000000   6062  4.738017
2014-01-04 06:04:56   6062  107352.066116   6062  4.900000
2014-01-04 06:06:05   6062  107200.000000   6062  4.938122
2014-01-04 06:10:58   6062   99267.955801   6062  5.100000
2014-01-04 06:12:07   6062   97400.000000   6062  5.119008
2014-01-04 06:17:01   6062   98857.851240   6062  5.200000
2014-01-04 06:18:10   6062   99200.000000   6062  5.200000
2014-01-04 06:23:03   6062   92805.801105   6062  5.200000
2014-01-04 06:24:12   6062   91300.000000   6062  5.257182
2014-01-04 06:29:05   6062   85472.375691   6062  5.500000
2014-01-04 06:30:14   6062   84100.000000   6062  5.500000
2014-01-04 06:35:08   6062   62151.239669   6062  5.500000
2014-01-04 06:36:17   6062   57000.000000   6062  5.500000

      

I think it is best to stop here as you keep all data from all files csv

. But you said you only want the time points from the longest csv

. To get this, you need to find the longest DataFrame

one and then get the lines that match its indices. Finding the longest one is DataFrame

easy, you just find the one that has the maximum length. Keeping only the timing in this is index

also easy, you just slice it with index

(you use a method loc

to slice this way).

longind = max(dataList.values(), key=len).index
df = df.loc[longind]

      

This gives you the following final DataFrame

:

                         a                     b      
                    Volume          Value Volume Value
Date                                                  
2014-01-04 05:52:50    NaN            NaN   6062   4.7
2014-01-04 05:58:53    NaN            NaN   6062   4.7
2014-01-04 06:04:56   6062  107352.066116   6062   4.9
2014-01-04 06:10:58   6062   99267.955801   6062   5.1
2014-01-04 06:17:01   6062   98857.851240   6062   5.2
2014-01-04 06:23:03   6062   92805.801105   6062   5.2
2014-01-04 06:29:05   6062   85472.375691   6062   5.5
2014-01-04 06:35:08   6062   62151.239669   6062   5.5

      

This can be combined into one line if you want:

df = pd.concat(dataList.values(), axis=1, keys=dataList.keys()).interpolate('time').loc[max(dataList.values(), key=len).index]

      

Or perhaps a somewhat crisp 4 lines:

names = dataList.keys()
dfs = dataList.values()
longind = max(dfs, key=len).index
df = pd.concat(dfs, axis=1, keys=names).interpolate('time').loc[longind]

      

I am not sure why my final results differ from what you are showing. I followed your example in MATLAB

(R2015A) myself and got the same results as me, so I suspect you have created the final data with a different dataset than this example.

+4


source







All Articles