R equivalent of Stata for-loop over local macro list stubnames

I am a Stata user who is switching to R, and there is one Stata crutch that I am having a hard time giving up because I don’t know how to do the equivalent with R "apply".

In Stata, I often create a local list of stubnames macros and then iterate over that list, calling variables whose names are built from those stubnames.

For a simple example, let's say I have the following dataset:

study_id year varX06 varX07 varX08 varY06 varY07 varY08
   1       6   50     40     30     20.5  19.8   17.4
   1       7   50     40     30     20.5  19.8   17.4
   1       8   50     40     30     20.5  19.8   17.4
   2       6   60     55     44     25.1  25.2   25.3
   2       7   60     55     44     25.1  25.2   25.3
   2       8   60     55     44     25.1  25.2   25.3 
   and so on...

      

I want to generate two new variables varX and varY that take the values ​​varX06 and varY06 respectively when the year is 6, varX07 and varY07 respectively when the year is 7, and varX08 and varY08 respectively when the year is 8. The final dataset should look like this :

study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
   1       6   50     40     30     20.5  19.8   17.4    50  20.5
   1       7   50     40     30     20.5  19.8   17.4    40  19.8
   1       8   50     40     30     20.5  19.8   17.4    30  17.4 
   2       6   60     55     44     25.1  25.2   25.3    60  25.1
   2       7   60     55     44     25.1  25.2   25.3    55  25.2
   2       8   60     55     44     25.1  25.2   25.3    44  25.3 
   and so on...

      

To be clear, I know I can do this by melting and changing the commands - essentially converting that data from wide format to long format, but I don't want to resort to that. This is not the purpose of my question. My question is about how to iterate over the local list of stubnames macros in R, and I'm just using this simple example to illustrate a more general dilemma.

In Stata, I could create a local list of stubnames macros:

local stub varX varY

      

And then let's move on to the list of macros. I can generate a new variable varX or varY and replace the new variable value with the value varX06 or varY06 (respectively) if the year is 6 and so on.

foreach i of local stub {

display "`i'"  
gen `i'=.      
replace `i'=`i'06 if year==6  
replace `i'=`i'07 if year==7
replace `i'=`i'08 if year==8
}

      

The last section is the section that is most difficult to replicate in R. When I write 'x'06

, Stata takes the string "varX", concatenates it with the string "06", and then returns the value of the variable varX06. Also, when I write 'i'

, Stata returns the string " varX ", not the string" i ". How can I do this with R?

I've searched through Muenchen "R for Stata Users", searched the web and looked at previous posts here on StackOverflow, but couldn't find a solution to R. I'm sorry if this question is elementary. If this has been answered please direct me to the answer.

Thanks in advance, Tara

+3


source to share


4 answers


Well, here's one way. Columns in R dataframes can be accessed using their symbol names, so this will work:

# create sample dataset
set.seed(1)    # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)),   #categorical variable
                 varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
                 varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))

# you start here...
years   <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))

print(head(df),digits=4)
#   year  varX06  varX07  varX08   varY06  varY07  varY08    varX     varY
# 1    6 -0.6265  0.8937 -0.3411 -0.70757  1.1350  0.3412 -0.6265 -0.70757
# 2    6  0.1836 -1.0473  1.5024  1.97157  1.1119  1.3162  0.1836  1.97157
# 3    6 -0.8356  1.9713  0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4    6  1.5953 -0.3836  0.5422 -0.01402  0.2107 -1.2056  1.5953 -0.01402
# 5    6  0.3295  1.6541 -0.1367 -1.12346  0.0694  1.5676  0.3295 -1.12346
# 6    6 -0.8205  1.5122 -1.1367 -1.34413 -1.6626  0.2253 -0.8205 -1.34413

      



For the given, the yr

anonymous function retrieves the rows with this yr

and the column named "varX0" + yr

(result paste0(...)

. Then lapply(...)

"applies" this function for each year, and unlist(...)

converts the returned list to a vector.

+2


source


This method reorders your data, but includes a one-liner which may or may not be better for you (assuming that d

's your dataframe):

> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
    study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1        1    6     50     40     30   20.5   19.8   17.4 20.5   50
6.4        2    6     60     55     44   25.1   25.2   25.3 25.1   60
7.2        1    7     50     40     30   20.5   19.8   17.4 19.8   40
7.5        2    7     60     55     44   25.1   25.2   25.3 25.2   55
8.3        1    8     50     40     30   20.5   19.8   17.4 17.4   30
8.6        2    8     60     55     44   25.1   25.2   25.3 25.3   44

      

Essentially it splits the data based on year

, then uses it within

to create variables varX

and varY

in each subset, and then the rbind

subsets back together.



A direct translation of your Stata code, however, would be something like this:

u <- unique(d$year)
for(i in seq_along(u)){
    d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
    d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}

      

0


source


Here's another option.

Create a "column selection matrix" based on year

, and then use it to capture values ​​from any column of columns.

# indexing matrix based on the 'year' column
col_select_mat <- 
    t(sapply(your_df$year, function(x) unique(your_df$year) == x))

# make selections from col groups by stub name
sapply(c('varX', 'varY'), 
    function(x) your_df[, grep(x, names(your_df))][col_select_mat])

      

This gives the desired output (which you can bind to your_df

if you like)

    varX varY
[1,]   50 20.5
[2,]   60 25.1
[3,]   40 19.8
[4,]   55 25.2
[5,]   30 17.4
[6,]   44 25.3

      

OP dataset:

your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
   1       6   50     40     30     20.5  19.8   17.4
   1       7   50     40     30     20.5  19.8   17.4
   1       8   50     40     30     20.5  19.8   17.4
   2       6   60     55     44     25.1  25.2   25.3
   2       7   60     55     44     25.1  25.2   25.3
   2       8   60     55     44     25.1  25.2   25.3')

      

Benchmarking: . Looking at the three published solutions, this appears to be the fastest on average, but the differences are very small.

df <- your_df
d <- your_df

arvi1000 <- function() {
  col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
  # make selections from col groups by stub name
  cbind(your_df, 
        sapply(c('varX', 'varY'), 
               function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}

jlhoward <- function() {
  years   <- unique(df$year)
  df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
  df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}

Thomas <- function() {
  do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}

> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
     expr min lq  mean median uq  max neval
 arvi1000  37 39 43.73     40 42  380   100
 jlhoward  38 40 46.35     41 42  377   100
   Thomas  37 40 56.99     41 42 1590   100

      

0


source


Perhaps a more transparent way:

sub <- c("varX", "varY")
for (i in sub) {
 df[[i]] <- NA
 df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
 df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
 df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}

      

0


source







All Articles