How to apply equation to Pandas ByGroup file frame
At a time when I was banging my head against the keyboard for too long and I would appreciate any advice. My goal is to ingest an hourly temperature data file that has spaces in it. I want to fill in those gaps using linear regression with temperature from a nearby site. But I want to do it IN A YEAR AND A MONTH. So with the help of the people here, I was able to do a lot. Now I have applied the BY YEAR AND MONTH linear regression function like
Corr_grouped=DF_grouped.apply(lambda x: stats.linregress(x [Labels[0]], x [Labels[3]]))
This triggered the following, which has slope, hook, r_value, p_value, std_err outputs and displays as follows.
> 2010 1 (0.806190897504, 5.75672188581, 0.901179913935...
> 2 (0.739906365408, 8.05204746237, 0.894050112908...
> 3 (0.773199101365, 6.88505178697, 0.898475211997... ...
> 10 (0.87497855294, 4.10227036556, 0.939948762031,...
> 11 (0.793072407801, 6.38604194806, 0.923659962858...
I have been reading all day without finding or finding the exact solution. Now my question is how to apply this back to the original dataframe. I would like a new column in DF to apply linear regression y = mx + c to each row of the original data using column 3 as input, but for that use specific coefficients (slope, intercept) that are different for each YEAR and MONTH ... Any ideas are most welcome :) cheers Jason
EDIT: Great. So DF looks like this. It has timestamps every 30 minutes for several years. It has spaces (NaN), which can be 1 or more intervals. I need to fill in the blanks in the original column (T_original) using the neighbor station relationship (T_nearby). But this is not an easy replacement. The location is often some distance away and the temperatures are correlated but not the same (i.e. the temperature may be 2 ° C higher). SO, then the station T_nearby should be adjusted and then used to fill the gap.
T_original T_nearby
2010-01-01 00:00:00 25.87873 25.4
2010-01-01 00:30:00 25.73089 25.4
2010-01-01 01:00:00 25.56144 25.4
2010-01-01 01:30:00 NaN 25.4
2010-01-01 02:00:00 25.24789 25.6
2010-01-01 02:30:00 25.17758 25.4
2010-01-01 03:00:00 NaN 25.6
2010-01-01 03:30:00 NaN 25.6
2010-01-01 04:00:00 25.07633 25.6
2010-01-01 04:30:00 24.99211 25.5
I want to split the analysis into YEAR and MONTH. Thus, every month of every year, a linear regression is calculated between T_original and T_nearby. This gives a grouped object that has linear regression parameters. For example, year 2010 and month 1 intercept is 5.75 and the slope is 0.806.
So, I would like to apply this relationship for all Year = 2010 and Month = 1 to look like this. Then, for the rest of the DF, the same approach is applied for every month of every year.
T_original T_nearby T_adjusted
1/01/2010 0:00 25.87873 25.4 26.2224
1/01/2010 0:30 25.73089 25.4 26.2224
1/01/2010 1:00 25.56144 25.4 26.2224
1/01/2010 1:30 NaN 25.4 26.2224
1/01/2010 2:00 25.24789 25.6 26.3836
1/01/2010 2:30 25.17758 25.4 26.2224
1/01/2010 3:00 NaN 25.6 26.3836
1/01/2010 3:30 NaN 25.6 26.3836
1/01/2010 4:00 25.07633 25.6 26.3836
1/01/2010 4:30 24.99211 25.5 26.303
Then I'll use the T_adjusted column to fill in the gap in T_original. thanks jason
source to share
Your first step is to merge the grouped object with the DF. To do this, first create a general grouping column.
For a grouped object:
from datetime import date
grouped['common'] = grouped.index.map(lambda x : date(x[0],x[1],1))
For DF:
DF['common'] = DF.index.map(lambda x : date(x.year,x.month,1))
Now you can combine it:
merged = DF.merge(grouped) del merged['common']
I'm not sure what the exact formula for calculating the T_adjusted column is, but now that the regression parameters match the T_nearby values, they can be handled using array operations.
To fill in the gaps in T_original using T_adjusted:
merged['T_original']=merged['T_original'].combine_first(merged['T_adjusted'])
source to share