Evaluation of success / failure error in R

I have success / failure data (trees that survived / died in a certain period) and would like to estimate the error from the binomial distribution that should be associated with each of my observations (7 sites). So far I have used glm

for this:

s <- c(1,20,0,40,2,1,0) # success
f <- c(2,0,20,4,50,0,1) # failure

#for each observation I would calculate this error: 

error <- vector ()  
z_scores <- vector ()  
p_value <- vector ()  

  for (i in 1:7) {
    models <- glm (cbind (s[i], f[i]) ~ 1, family = 'binomial')
    error [i] <- summary (models)$coefficients[2]
    z_scores [i] <- summary (models)$coefficients[3]
    p_value [i] <- summary (models)$coefficients[4]
  }

      

Would this be the best approach?

How is the probability of a binomial distribution estimated?

Please note that regardless of the number of successful and unsuccessful errors, my error is extremely high when s

or f

are=0

+3


source to share


1 answer


Here is some code to double-check most of your results (other than the marginal ones caused by zero) without using glm

, and I explain their meaning.

s <- c(1, 20, 0, 40, 2, 1, 0) # success
f <- c(2, 0, 20, 4, 50, 0, 1) # failure

#for each observation I would calculate this error: 

error <- vector()  
z_scores <- vector()  
p_value <- vector()  

for (i in 1:7) {
    models <- glm(cbind(s[i], f[i]) ~ 1, family = 'binomial')
    error[i] <- summary(models)$coefficients[2]
    z_scores[i] <- summary(models)$coefficients[3]
    p_value[i] <- summary(models)$coefficients[4]
}

logit <- function(x){
    log(x / (1 - x))
}

dlogit <- function(x){
    1 / x / (1 - x)
}

p_hat <- s / (s + f)
## sqrt(p_hat * (1 - p_hat) / (s + f))
## is the standard error of p_hat
## error1 is the standard error of logit(p_hat)
error1 <- dlogit(p_hat) * sqrt(p_hat * (1 - p_hat) / (s + f))
## divide the estimation by the standard error, you get z-score
z_scores1 <- logit(p_hat) / error1
p_value1 <- 2 * pnorm(-abs(z_scores1))

      

The first thing you need to know is the rationale for the standard error, z-score, p-value, etc. In statistics, we first have some model (in this case a binomial model: s|(s+f) ~ Binomial(s + f, p))

and we want to use it to match the available data and

1) get estimates ( p

in this case)

2) since the data is randomly generated, we want to know how good our estimate is, here comes the standard error, z-scores and p-value to "measure randomness in the estimate", and here are some important "tricks": since we do not know the true mechanism that generates the data, we can roughly calculate the randomness in our estimate by making assumptions

a) our model (or similar) is the true data generation engine, and

b) the real parameter is similar to our estimate (this often requires a large sample size, in this case the sample size is all s + f

, so it s + f

must be large enough to draw a conclusion (standard error, z -score and p-value). And we see that in the case i = 1, 6 and 7 the sample size is really small, which makes incredible standard errors, z-scores and p-values.



And then I can talk about the technical details of my calculations and what they mean. In glm

addition to the model Binomial(n, p)

, you also assume the model for the p

following:

logit(p) ~ N(mu, sigma^2)

And the logit function is the same as in my code.

In this simple case, the estimate of the binomial probability p

is simply p_hat <- s / (s + f)

(whether it is used glm

or not), and from the variance formula for the binomial variable, we can obtain the variance for the estimated probability p

is p * (1 - p) / n

, here if we assume that is p_hat <- s / (s + f)

similar to the real one p

by assumption b and use it to replace p

, we can get the standard error for the estimated p

. Following the CLT and Delta method, when the sample size is large enough, we can consider s / (s + f)

either logit(s / (s + f))

as a normal distribution, for example, s / (s + f)

is approximately N(p, s * f / (s + f) ^ 3)

, and logit(s / (s + f))

is approximately N(logit(p), dlogit(s / (s + f)) ^ 2 * s * f / (s + f) ^ 3)

.

In simple terms, the standard error, z-scores, and p-values ​​that are calculated glm

are the standard error, z-scores, and p-values ​​for logit(s / (s + f))

. It is permissible for the results of the null hypothesis: logit(p) = 0

in other words p = 0.5

. Thus, z-scores and p-values ​​derived from glm

should test whether s

and f

with equal probability when the sample size is s + f

large.

And then I will talk about the extreme values ​​caused by 0. When s

or f

is equal to 0, the expected probability will be equal to f

or s

equal to 1, if so, the given generation mechanism is actually not random !! At the beginning I said that we use our estimates to roughly calculate the randomness in our estimates, and in the case where s

or f

is equal to 0, if we use our estimates as the main truth, we should assume that our estimates are 100% percent, which is quite funny. And in such cases, many methods such as glm

will not be valid. Generally speaking, if the sample size s + f

is large enough, we consider that the probability of s

either is f

really small if s = 0

orf = 0

but if the sample size is really small, as in case 6 or 7, we cannot actually come to any conclusion.

In sum, if the binomial model is true, from the result glm

, my code and my analysis, as indicated above, we can say that in the case the i = 2, 3, 4, 5

probability s

and f

equal are significantly different from each other.

+5


source







All Articles