How to Resolve the 95% Confidence Interval Mismatch in the Survival Function in R
I am about to write some functions to extract information from the survival analysis results, and I ran into a discrepancy between the extraction of the lower and upper survival times as defined by the 95% confidence interval and what is reported from the package itself as a summary.
I am using package survival
(v 2.37-7) in R (v 3.1.2).
So my problem is that sometimes my extraction of the lower and / or upper bounds of the 95% CI for the median survival time does not match what is returned when I simply evaluate the results survfit
. When I check the data I think the results are survfit
wrong, it seems to return a border value + 1 (again, sometimes). Here is some data that illustrates the problem.
# Fit my data stratified by gender of subject
survFit30Sex <- survfit(Surv(thirtyDaySuicides$daysFromInvestigation) ~ thirtyDaySuicides$Sex)
# Display median survival and confidence interval
survFit30Sex
Call: survfit(formula = Surv(thirtyDaySuicides$daysFromInvestigation) ~
thirtyDaySuicides$Sex)
records n.max n.start events median 0.95LCL 0.95UCL
thirtyDaySuicides$Sex=1 35 35 35 35 15 9 20
thirtyDaySuicides$Sex=2 93 93 93 93 9 6 13
survfit
defines the lower and upper bounds for Sex = 1
as 9 days and 20 days respectively, but when I check the data it seems like the upper bound should be 19 and not 20
Here is the actual data; I am just showing Sex=1
, since this is a mismatch, I have also stripped out the values long before and after the critical region to make the data easier to read
Call: survfit(formula = Surv(thirtyDaySuicides$daysFromInvestigation) ~
thirtyDaySuicides$Sex)
summary( thirtyDaySuicides$Sex=1 )
time n.risk n.event survival std.err lower 95% CI upper 95% CI
9 24 2 0.6286 0.0817 0.48725 0.811
10 22 1 0.6000 0.0828 0.45780 0.786
11 21 1 0.5714 0.0836 0.42890 0.761
13 20 1 0.5429 0.0842 0.40055 0.736
14 19 1 0.5143 0.0845 0.37272 0.710
15 18 1 0.4857 0.0845 0.34541 0.683
16 17 1 0.4571 0.0842 0.31861 0.656
17 16 3 0.3714 0.0817 0.24138 0.572
19 13 1 0.3429 0.0802 0.21673 0.542
20 12 2 0.2857 0.0764 0.16921 0.482
21 10 2 0.2286 0.0710 0.12437 0.420
22 8 1 0.2000 0.0676 0.10310 0.388
As I understand it, the lower 95% CI for median survival time is 0.34541. Searching down the survival column until a value <0.34541 is found occurs in the series associated with survival time 19 (survival = 0.3429). Isn't this the upper limit? Why survfit
Top 20 Survival Time Returns? I have automated this algorithm, and most of the time I compare the outcome to survival, but not always.
This makes me think that there is either some strange bug in the package survival
(which I doubt) or I am finding the border wrong (most likely).
--------- Update
Unfortunately I don't know how to link the data file to my question, but the data is pretty short, so I can put it here. Note that I have eliminated stratification by Sex for simplicity, so this is just the data for women where I get the discrepancy.
It seems to me that I was approaching this the wrong way, perhaps 95% of the CI is being calculated from the standard error rather than looking how I think about it. But even with this idea, I have similar problems. More generally, the question is: how do you pull the Xth percentile survival time with its 95% CI in time units from the survivor object?
Here's the survival data via dput and then an unstructured copy below that.
structure(list(daysFromInvestigation = c(27L, 27L, 10L, 20L,
15L, 21L, 27L, 1L, 9L, 22L, 29L, 14L, 4L, 19L, 7L, 3L, 2L, 7L,
21L, 4L, 17L, 20L, 16L, 2L, 9L, 7L, 17L, 2L, 17L, 26L, 25L, 11L,
3L, 13L, 27L), censored = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1)), class = "data.frame", row.names = c(NA, -35L), .Names = c("daysFromInvestigation",
"censored"))
daysFromInvestigation censored
1 27 1
2 27 1
3 10 1
4 20 1
5 15 1
6 21 1
7 27 1
8 1 1
9 9 1
10 22 1
11 29 1
12 14 1
13 4 1
14 19 1
15 7 1
16 3 1
17 2 1
18 7 1
19 21 1
20 4 1
21 17 1
22 20 1
23 16 1
24 2 1
25 9 1
26 7 1
27 17 1
28 2 1
29 17 1
30 26 1
31 25 1
32 11 1
33 3 1
34 13 1
35 27 1
source to share
I have an answer to my question, at least a good rough answer, if not a better answer.
The main problem I ran into was not to use a weighted average. In my question, I was interested in the average survival time, so survival = 0.5. But my data did not lead to events at the exact median time, and therefore I had a survival probability of 14 days = 0.5143 and 15 days = 0.4857, which weighted averaged up to 15 days.
The second problem was a misunderstanding of the use of confidence intervals. To match what is reported in the survival report, to find the lower end of the median survival interval, one looks for the lower anchor vector to find the first value that is less than the median, and then calculates the time weighted average for the value just below the median and slightly above. Likewise, for the upper bound, the upper bound vector is searched to find the target interval, and then the weighted average is calculated. For my example, the upper bound for median survival occurs between 19 days and 20 days. Weighted average rounds up to 20 days.
I haven't traced the survival code deeply to confirm this is correct, but in my case I have about 50 specific survival combinations looking at different time periods and different moderators, and I agree the median output provided by the 100% survival package.
I hope that someone who comes across this question is helped by this summary, and if anyone wants to help fix / improve my understanding it is very much appreciated.
source to share