Perform Multiple Survival Analysis with a Loop in R

I recently worked on a survival analysis with R. I have two data frames, geneDf for gene expression, survDf for follow-up. As the following samples:

#Data frame:geneID  
geneID=c("EGFR","Her2","E2F1","PTEN")
patient1=c(12,23,56,23)
patient2=c(23,34,11,6)
patient3=c(56,44,32,45)
patient4=c(23,64,45,23)
geneDf=data.frame(patient1,patient2,patient3,patient4,geneID)
> geneDf
  patient1 patient2 patient3 patient4 geneID
1       12       23       56       23   EGFR
2       23       34       44       64   Her2
3       56       11       32       45   E2F1
4       23        6       45       23   PTEN
#Data frame:survDf
ID=c("patient1","patient2","patient3","patient4")
time=c(23,7,34,56)
status=c(1,0,1,1)
survDf=data.frame(ID,time,status)
#    
> survDf
        ID time status
1 patient1   23      1
2 patient1    7      0
3 patient1   34      1
4 patient1   56      1

      

I extract the expression data for a particular gene from genDf and use the median expression as a cut-off value to perform survival analysis with the survival package and obtain the p-value from the Surdiff. In the following codes, I am using the "EGFR" gene as an example.

#extract expression of a certain gene
targetGene<-subset(geneDf,grepl("EGFR",geneDf$geneID))
targetGene$geneID<-NULL
#Transpose the table and adjust its format
targetGene<-t(targetGene[,1:ncol(targetGene)])
targetGene<-data.frame(as.factor(rownames(targetGene)),targetGene)
colnames(targetGene)<-c("ID","Expression")
rownames(targetGene)<-NULL
targetGene$Expression1<-targetGene$Expression
 targetGene$Expression1[ targetGene$Expression<median( targetGene$Expression)]<-1
targetGene$Expression1[ targetGene$Expression>=median( targetGene$Expression)]<-2
#Survival analysis
library(survival)
##Add survival object
survDf$SurvObj<-with(survDf, Surv(time,status==1))
## Kaplan-Meier estimator for stage
km<-survfit(SurvObj~targetGene$Expression1, data=survDf, conf.type = "log-log")
sdf<-survdiff(Surv(time, status) ~targetGene$Expression1, data=survDf)
#gain p value
p.val <-1-pchisq(sdf$chisq, length(sdf$n) - 1)
> p.val
[1] 0.1572992

      

I can do this through different genes one by one. But the question is, there are over 10,000 genes to be analyzed. I want to get all the p-values ​​from them and put them in a new dataframe. Do I need a use cycle or apply?

+3


source to share


1 answer


This is an ugly scritp, but it works.

In Data10, in the first column, you need to have the time, in the second, the status, and in the next, any treatment you want. (patients as pink names)



loopsurff<-function(Data10){combos<-
rbind.data.frame(rep(1,ncol(Data10)- 2),
rep(2,ncol(Data10)-2),rep(3:(ncol(Data10)-2),1))
combos<-as.matrix(sapply(combos, as.numeric));library(plyr);
library(survival) 
vv<-adply(combos, 2, function(x) {
fit <-survdiff(Surv(Data10[,1], Data10[,2]) ~ Data10[, x[3]],data=Data10)
p<-1 - pchisq(fit$chisq, 1)
out <- data.frame("var1"=colnames(Data10)[x[3]],"p.value" =   
as.numeric(sprintf("%.3f", p)))
return(out)  
})
}`

      

You will get a data frame with the column names yourdata [, 3: ncol (yourdata)] and a p value for each one.

-1


source







All Articles