How to apply equation to Pandas ByGroup file frame

At a time when I was banging my head against the keyboard for too long and I would appreciate any advice. My goal is to ingest an hourly temperature data file that has spaces in it. I want to fill in those gaps using linear regression with temperature from a nearby site. But I want to do it IN A YEAR AND A MONTH. So with the help of the people here, I was able to do a lot. Now I have applied the BY YEAR AND MONTH linear regression function like

Corr_grouped=DF_grouped.apply(lambda x: stats.linregress(x [Labels[0]], x [Labels[3]]))

      

This triggered the following, which has slope, hook, r_value, p_value, std_err outputs and displays as follows.

> 2010  1     (0.806190897504, 5.75672188581, 0.901179913935...
>       2     (0.739906365408, 8.05204746237, 0.894050112908...
>       3     (0.773199101365, 6.88505178697, 0.898475211997... ...
>       10    (0.87497855294, 4.10227036556, 0.939948762031,...
>       11    (0.793072407801, 6.38604194806, 0.923659962858...

      

I have been reading all day without finding or finding the exact solution. Now my question is how to apply this back to the original dataframe. I would like a new column in DF to apply linear regression y = mx + c to each row of the original data using column 3 as input, but for that use specific coefficients (slope, intercept) that are different for each YEAR and MONTH ... Any ideas are most welcome :) cheers Jason

EDIT: Great. So DF looks like this. It has timestamps every 30 minutes for several years. It has spaces (NaN), which can be 1 or more intervals. I need to fill in the blanks in the original column (T_original) using the neighbor station relationship (T_nearby). But this is not an easy replacement. The location is often some distance away and the temperatures are correlated but not the same (i.e. the temperature may be 2 ° C higher). SO, then the station T_nearby should be adjusted and then used to fill the gap.

                        T_original  T_nearby
2010-01-01 00:00:00  25.87873       25.4
2010-01-01 00:30:00  25.73089       25.4
2010-01-01 01:00:00  25.56144       25.4
2010-01-01 01:30:00  NaN         25.4
2010-01-01 02:00:00  25.24789       25.6
2010-01-01 02:30:00  25.17758       25.4
2010-01-01 03:00:00  NaN         25.6
2010-01-01 03:30:00  NaN         25.6
2010-01-01 04:00:00  25.07633       25.6
2010-01-01 04:30:00  24.99211       25.5

      

I want to split the analysis into YEAR and MONTH. Thus, every month of every year, a linear regression is calculated between T_original and T_nearby. This gives a grouped object that has linear regression parameters. For example, year 2010 and month 1 intercept is 5.75 and the slope is 0.806.

So, I would like to apply this relationship for all Year = 2010 and Month = 1 to look like this. Then, for the rest of the DF, the same approach is applied for every month of every year.

    T_original      T_nearby    T_adjusted
1/01/2010 0:00  25.87873    25.4    26.2224
1/01/2010 0:30  25.73089    25.4    26.2224
1/01/2010 1:00  25.56144    25.4    26.2224
1/01/2010 1:30  NaN             25.4    26.2224
1/01/2010 2:00  25.24789    25.6    26.3836
1/01/2010 2:30  25.17758    25.4    26.2224
1/01/2010 3:00  NaN             25.6    26.3836
1/01/2010 3:30  NaN             25.6    26.3836
1/01/2010 4:00  25.07633    25.6    26.3836
1/01/2010 4:30  24.99211    25.5    26.303

      

Then I'll use the T_adjusted column to fill in the gap in T_original. thanks jason

+3


source to share


1 answer


Your first step is to merge the grouped object with the DF. To do this, first create a general grouping column.

For a grouped object:

from datetime import date
grouped['common'] = grouped.index.map(lambda x : date(x[0],x[1],1))

      

For DF:

DF['common'] = DF.index.map(lambda x : date(x.year,x.month,1))

      



Now you can combine it:

merged = DF.merge(grouped)
del merged['common']

      

I'm not sure what the exact formula for calculating the T_adjusted column is, but now that the regression parameters match the T_nearby values, they can be handled using array operations.

To fill in the gaps in T_original using T_adjusted:

merged['T_original']=merged['T_original'].combine_first(merged['T_adjusted'])

      

+1


source







All Articles