Linear regression in scikit learn
I have a question regarding the LinearRegression model in learning scikit
( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html )
If we run the following code:
from sklearn import linear_model
import pandas as pd
import numpy as np
dates = pd.date_range('20000101', periods=100)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(100))
df['B'] = np.cumsum(np.random.randn(100))
df['C'] = np.cumsum(np.random.randn(100))
df['D'] = np.cumsum(np.random.randn(100))
df['E'] = np.cumsum(np.random.randn(100))
df['F'] = np.cumsum(np.random.randn(100))
y = ['A','B','C']
x = ['D','E','F']
ols = linear_model.LinearRegression(fit_intercept = True,
normalize = True,
copy_X = True,
n_jobs = 1)
ols.fit(df[x],df[y])
What is it doing here?
Are there 3 different OLS regressions? Value,
1) OLS df['A']
withdf[['D','E','F']]
2) OLS df['B']
with df[['D','E','F']]
and
3) OLS df['C']
withdf[['D','E','F']]
Or does it work with one OLS df[['A','B','C']]
with df[['D','E','F']]
(I think this is called SUR? Not sure ...)
source to share
I did some tests to figure out this case.
After running the code
ols.coef_
array([[-0.5273036 , 0.56382854, 0.24751725], # train for 'A'
[-0.10430077, 0.10671576, 0.18554053], # train for 'B'
[ 0.01481826, 0.03811442, 0.75333578]]) # train for 'C'
We can see that coef contains 3 arrays and each array has three parameters.
Then we execute
a = linear_model.LinearRegression(fit_intercept = True,
normalize = True,
copy_X = True,
n_jobs = 1)
a.fit(df[x],df['A'])
a.coef_
array([-0.5273036 , 0.56382854, 0.24751725])
which gives us the same coefficient as the first array we got above
a.fit(df[x],df['B'])
a.coef_
array([-0.10430077, 0.10671576, 0.18554053])
which gives us the same coefficient as the second array we got above, etc.
So when you call ols.fit(df[x],df[y])
it creates three different linear regressions for your three target outputy
source to share