Why do mean () and mean (aggregate ()) return different results?

I want to calculate the average. Here is some sample data:

# sample data
Nr <- c(1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
dph <- c(3.125000, 6.694737, 4.310680, 11.693735, 103.882353, 11.000000, 7.333333, 20.352941, 5.230769, NA, 4.615385, 47.555556, 2.941176, 18.956522, 44.320000, 28.500000, NA, 10.470588, 19.000000, 25.818182, 43.216783, 51.555556, 8.375000, 6.917647, 9.375000, 5.647059, 4.533333, 27.428571, 14.428571, NA, 1.600000, 5.764706, 4.705882, 55.272727, 2.117647, 30.888889, 41.222222, 23.444444, 2.428571, 6.200000, 17.076923, 21.280000, 40.829268, 14.500000, 6.250000, NA, 15.040000, 5.687204, 2.400000, NA, 26.375000, 18.064516, 4.000000, 6.139535, 8.470588, 128.666667, 2.235294, 34.181818, 116.000000, 6.000000, 5.777778, 10.666667, 15.428571, 54.823529, 81.315789, 42.333333)
dat <- data.frame(cbind(Nr = Nr, dph = dph))

# calculate mean directly
mean(dat$dph, na.rm = TRUE)
[1] 23.02403

# aggregate first, then calculate mean
mean(aggregate(dph ~ Nr, dat, mean, na.rm = T)$dph)
[1] 22.11743

# 23.02403 != 22.11743

      

Why am I getting two different results?


Explanation for the question:

I need to perform a Wilcoxon test comparing the preliminary baseline to the original baseline. Pre - 3 dimensions, post - 16. Since the Wilcoxon test requires two vectors of equal length, I calculate the means for pre and post for each patient c aggregate

, creating two vectors of equal length. Above the data before.

Edit:

Patient no. 4 has been removed from the data. But using it Nr <- rep(1:22, 3)

returns the same results.

+3


source to share


1 answer


I think this is due to the fact that in the version mean(dat$x, na.rm=T)

each removed NA

decreases the number of observations by 1, whereas if you aggregate first, in your example you have an NA on line 10 (ID 11) that is removed, but since the other rows with ID 11 do not contain NA (or at least one of them does not), the number of observations (unique IDs) you use to calculate mean

after aggregation for each ID does not decrease by 1 for each NA. Thus, the IMO difference comes from dividing the sum dph

, which should be the same in both calculations, for different numbers of observations.

You can check this by changing the NA records to 0 and calculating the average from both versions, they will return the same.



But in general you should notice that it only works here because you have the same number of observations for each ID (in this case 3). If they were different, you would get different results again.

+2


source







All Articles