Tough loop in R?
I have been struggling for days to solve this problem in R (I am a former SAS user).
Tuning / Research - Observational data. Crohn's disease patients. Data were collected annually during 2002-2013. - Patients can be enrolled in any year and visits may be irregular on an annual basis. - I know the exact day of death for each patient. VARIABLE: DEATH_YEAR - I know the exact day of the relapse (endpoint of interest). Variable: RELAPSE_YEAR
I am interested in the incidence of relapse and I need to calculate the number of relapses each year divided by the number of people living that year. Now the problem is that people come from inclusion irregularly, but I know if they are really alive this year and if they have relapsed.
I could solve this if I could create 12 new variables for each patient. Each new variable must be a calendar year, and this variable must be set to "1" if the patient is alive this year and has not yet experienced the event.
So the problem is that I need to create "year-variables" that are set to "1" for each year when turned on, and after that, given that the person did not die or survived the event.
Example: Patient X was included in 2005 and died in 2009. For him I need his following variables: "2005", "2006", "2007", "2008" and "2009" - "1". Patient Y was included in 2005 and experienced event 2007. For that I need the following variables: "2005", "2006", 2007 "set to" 1 ". (Yes, the year of the event / death should still be set to" 1" ).
This is what my dataset looks like:
data <- read.table(header = TRUE, text = "
patient visit first_visit relapse_year death_year
1 2003 2003 . 2010
1 2004 2003 . 2010
1 2009 2003 . 2010
2 2002 2002 2006 .
2 2006 2002 2006 .
2 2006 2002 2006 .
2 2008 2002 2006 .
2 2012 2002 2006 .
3 2004 2004 . .
3 2008 2004 . .
3 2008 2004 . .
")
Here is the DESIRED set
desired_data <- read.table(header = TRUE, text = "
patient visit first_visit relapse_year death_year YEAR2002 YEAR2003 YEAR2004 YEAR2005 YEAR2006 YEAR2007 YEAR2008 YEAR2009 YEAR2010 YEAR2011 YEAR2012
1 2003 2003 . 2010 . 1 1 1 1 1 1 1 1 . .
1 2004 2003 . 2010 . 1 1 1 1 1 1 1 1 . .
1 2009 2003 . 2010 . 1 1 1 1 1 1 1 1 . .
2 2002 2002 2006 . 1 1 1 1 1 . . . . . .
2 2006 2002 2006 . 1 1 1 1 1 . . . . . .
2 2006 2002 2006 . 1 1 1 1 1 . . . . . .
2 2008 2002 2006 . 1 1 1 1 1 . . . . . .
2 2012 2002 2006 . 1 1 1 1 1 . . . . . .
3 2004 2004 . . . . 1 1 1 1 1 1 1 1 1
3 2008 2004 . . . . 1 1 1 1 1 1 1 1 1
3 2008 2004 . . . . 1 1 1 1 1 1 1 1 1
")
I would be extremely grateful for any advice on this matter! Thanks in advance!
source to share
It's a bit hacky, but it will work. First turn your data into a numeric dataframe to .
turn into NA
:
data0<-data.frame(lapply(data,function(x) as.numeric(as.character(x))))
head(data0)
# patient visit first_visit relapse_year death_year
# 1 1 2003 2003 NA 2010
# 2 1 2004 2003 NA 2010
# 3 1 2009 2003 NA 2010
# 4 2 2002 2002 2006 NA
# 5 2 2006 2002 2006 NA
# 6 2 2006 2002 2006 NA
Then replace 2012 (or whatever was last year) with NA values.
data0[is.na(data0)]<-2012
Now you can use pmin
to determine how long until the patient dies / repeats / the experiment ends. The last thing to do is use arithmetic on column numbers to create a new dataset:
activeYears<-matrix(0,nrow(data0),11)
colnames(activeYears)<-2002:2012
startYear<-data0$first_visit[row(activeYears)]
endYear<-pmin(data0$relapse_year[row(activeYears)],data0$death_year[row(activeYears)])
colYear<-col(activeYears)+2001
activeYears[]<-startYear<=colYear & endYear>=colYear
activeYears
# 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
# [1,] 0 1 1 1 1 1 1 1 1 0 0
# [2,] 0 1 1 1 1 1 1 1 1 0 0
# [3,] 0 1 1 1 1 1 1 1 1 0 0
# [4,] 1 1 1 1 1 0 0 0 0 0 0
# [5,] 1 1 1 1 1 0 0 0 0 0 0
# [6,] 1 1 1 1 1 0 0 0 0 0 0
# [7,] 1 1 1 1 1 0 0 0 0 0 0
# [8,] 1 1 1 1 1 0 0 0 0 0 0
# [9,] 0 0 1 1 1 1 1 1 1 1 1
#[10,] 0 0 1 1 1 1 1 1 1 1 1
#[11,] 0 0 1 1 1 1 1 1 1 1 1
source to share