Statistical analysis of server logs - correct extrapolation

We had an ISP failure for about 10 minutes in one day, which unfortunately happened during an exam that was recorded from multiple locations.

Unfortunately, this resulted in a loss of postback data for the current candidate page.

I can restore the stream of events from the server log. However, out of 317 candidates, 175 used a local proxy, which means they all come from the same IP address. I analyzed the data from the remaining 142 (45%) and came up with some good numbers as to what happened to them.

Question: How correct is it to multiply all my numbers by 317/142 to achieve the likely results for the whole set? What will be my (un) confidence region?

Please, no guesswork. I need someone who hasn't fallen asleep in statistics class to answer.

EDIT: By numbers I have referenced the counts of the affected persons. for example 5/142 showed signs of browser crashing during session. How correct is the 11/317 extrapolation with browser errors?

+1


source to share


1 answer


I don't know exactly what measurements we are talking about, but for now, let's assume you want something like a GPA. No adjustment is required to estimate the average score of the population (317 candidates). Just use the average of the sample (142, the data of which you analyzed).

To find the region of uncertainty, you can use the formula provided in the NIST Statistics Handbook . You must first decide how unsure you are. Suppose you want 95% certainty that the true population means false within an interval. Then the confidence interval for the true value of the population will be as follows:

(sample mean) +/- 1.960 * (sample standard deviation) / sqrt (sample size)

There are additional fixes you can make to get credit for a large sample of the population. They will tighten the confidence interval by about 1/4, but there are many assumptions that the above calculation makes this less conservative. One of the assumptions is that the estimates are approximately normally distributed. Another assumption is that the sample is representative of the population. You mentioned that the missing data are all candidates using the same proxy. The subset of the population that used this proxy could be very different from the rest.



EDIT: Since we are talking about the proportion of a sample with an attribute eg. "browser crashed", things are a little different. We need to use a confidence interval for the proportion and convert it to multiple successes by multiplying by the population size. This means our best estimate of the number of crashed browsers is 5 * 317/142 ~ = 11 as you assumed.

If we once again ignore the fact that our sample is nearly half the population, we can use the Wilson confidence interval in proportion.A calculator is available online to handle the formula for you. The output of the calculator and formula is the upper and lower limits for the proportion in the population. To get a range for the number of outages, simply multiply the upper and lower limits by (population size - sample size) and add the number of outages in the sample. While we could just multiply by the population size to get the interval, this will ignore what we already know about our example.

Using the above procedure, we get a 95% CI of 7.6 to 19.0 for a total browser crashes in a population of 317 based on 5 crashes at 142 sample points.

+2


source







All Articles