What is the measure of how well the data is * focused * on the prediction line in LM

Question

What is the measure of how well the data is * focused * on the prediction line in LM

I have two datasets that I use the R command with lm

. The first chart below does not focus on the red line. But the second graph from the right is centered on the line.

Data1 Data2

My questions:

What is the measure of how well the data is line oriented?
How do I extract this from the data structure?

The code I'm using to plot this data is simple:

 data <-read.table("myfile.txt")
 dat1x <- data$x1
 dat1y <- data$y1


 # plot left figure
 dat1_lm <- lm(dat1x ~ dat1y)
 plot(dat1x ~ dat1y)
 abline(coef(dat1_lm),col="red")
 dat1_lm.r2  <- summary(dat1_lm)$adj.r.squared;

 # repeat the the same for right figure
 dat2x <- data$x2
 dat2y <- data$y2
 dat2_lm <- lm(dat2x ~ dat2y)
 plot(dat2x ~ dat2y)
 abline(coef(dat2_lm),col="red")
 dat2_lm.r2  <- summary(dat2_lm)$adj.r.squared;

Update Plot with RMSE:

F1g1 enter image description here

I'm looking for an estimate that shows that the correct shape is better than the left, based on centering the data along the prediction line.

+3

r statistics lm

neversaint Jan 29. 13 at 10:23

source to share

2 answers

Paul Hiemstra · Answer 1 · 2013-01-29T10:29:01+0000

R-squared gives a good fit to the line, i.e. the percentage of change in the dataset that is explained by the linear model. Another way to explain R-squared is how much better the model performs than the average model. The p-values give a fit value. That is, the coefficient of the linear model is significantly different from zero.

To extract these values:

dat = data.frame(a = runif(100), b = runif(100))
lm_obj = lm(a~b, dat)
rsq = summary(lm_obj)[["r.squared"]]
p_value = summary(lm_obj)[["coefficients"]]["b","Pr(>|t|)"]

Alternatively, you can calculate the RMSE between observations and linear model results:

rmse = sqrt(mean((dat$a - predict(lm_obj))^2))

Please note that this is an RMSE a

and a linear model. If you want RMSE a

and b

:

rmse = sqrt(mean((dat$a - dat$b)^2))

Julius · Answer 2 · 2013-01-29T13:49:05+0000

What would you be looking for, MAPE (Mean Absolute Percentage Error). Its advantages over other measures of accuracy (MSE, MPE, RMSE, MAE, etc.) are that MAPE is independent of levels, it measures absolute errors and has a clear meaning. You can use a package forecast

to get some of these measures:

library(forecast)
data <- data.frame(y = rnorm(100), x = rnorm(100))
model <- lm(y ~ x, data)
accuracy(model)
#           ME         RMSE          MAE          MPE         MAPE 
# 5.455773e-18 1.019446e+00 7.957585e-01 1.198441e+02 1.205495e+02 
accuracy(model)["MAPE"]
#     MAPE 
# 120.5495

or

mape <- function(f, x) mean(abs(1 - f / x) * 100)
mape(fitted(model), data$y)
# [1] 120.5495

On the other hand, it might seem that MPE (Mean Percentage Error) is better to show how much the data is centered around the forecast line, for example, let the forecast p <- rep(2, 20)

and the data y <- rep(c(3,1), 10)

, then MPE = 0

, but MAPE = 100%

.

So you have to decide what you really want to show, MAPE is better as a measure of accuracy, but for you the second MPE example might be the better choice.

Update: if it does indeed center what you want to check, you should look at measures that sum errors without any squares, absolute values, etc. That is, you might also want to take a look at ME (medium error), which is slightly simpler than MPE, but has a different interpretation. Here's an example similar to the first of yours:

enter image description here

mpe <- function(f, x) mean((1 - f / x) * 100)
mape <- function(f, x) mean(abs(1 - f / x) * 100)
me <- function(f, x) mean(x - f)

set.seed(20130130)
y1 <- rnorm(1000, mean = 10, sd = 1.5) * (1:1000) / 300
y2 <- rnorm(1000, mean = 10, sd = 1.7) * (1:1000) / 250
pr <- (1:1000) / 30

data <- data.frame(y = c(y1, y2),
                   x = 1:1000,
                   prediction = rep(pr, 2),
                   id = rep(1:2, each = 1000))

results <- data.frame(MAPE = c(mape(pr, y1), mape(pr, y2)),
                      MPE = c(mpe(pr, y1), mpe(pr, y2)),
                      ME = c(me(pr, y1), me(pr, y2)),
                      id = 1:2)
results <- round(results, 2)

ggplot(data, aes(x, y)) + geom_line() + theme_bw() +
  facet_wrap(~ id) + geom_line(aes(y = prediction), colour = "red") +
  theme(strip.background = element_blank()) + labs(y = NULL, x = NULL) +
  geom_text(data = results, x = 150, y = 50, aes(label = paste("MAPE:", MAPE))) +
  geom_text(data = results, x = 150, y = 45, aes(label = paste("MPE:", MPE))) + 
  geom_text(data = results, x = 150, y = 40, aes(label = paste("ME:", ME)))

What is the measure of how well the data is * focused * on the prediction line in LM

More articles: