Extend data.frame to long format and increment value

I would like to convert my data from short format to long format and I suppose there is an easy way to do this (perhaps with reshape2, plyr, dplyr, etc.?).

For example, I have:

foo <- data.frame(id = 1:5, 
              y = c(0, 1, 0, 1, 0),
              time = c(2, 3, 4, 2, 3))

id y time
1  0  2
2  1  3
3  0  4
4  1  2
5  0  3

      

I would like to expand / copy each row n times, where n is the value of the row in the "time" column. However, I would also like the variable "time" to increase from 1 to n. That is, I would like to create:

id  y time
1   0   1
1   0   2
2   1   1
2   1   2
2   1   3
3   0   1
3   0   2
3   0   3
3   0   4
4   1   1
4   1   2
5   0   1
5   0   2
5   0   3

      

As a bonus, I would also like to do some sort of increment of the "y" variable, where for those IDs with y = 1, y is set to 0 to the largest "time" value. That is, I would like to create:

id  y time
1   0   1
1   0   2
2   0   1
2   0   2
2   1   3
3   0   1
3   0   2
3   0   3
3   0   4
4   0   1
4   1   2
5   0   1
5   0   2
5   0   3

      

This sounds like something dplyr can already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.

+3


source to share


4 answers


You can create a new dataframe with matching columns id

and time

for long format and then merge it into the original. This leaves NA

for unmatched values, which can then be replaced with 0

:

merge(foo, 
      with(foo, 
           data.frame(id=rep(id,time), time=sequence(time))
      ), 
      all.y=TRUE
)
##    id time  y
## 1   1    1 NA
## 2   1    2  0
## 3   2    1 NA
## 4   2    2 NA
## 5   2    3  1
## 6   3    1 NA
## 7   3    2 NA
## 8   3    3 NA
## 9   3    4  0
## 10  4    1 NA
## 11  4    2  1
## 12  5    1 NA
## 13  5    2 NA
## 14  5    3  0

      

A similar fusion works for the first decomposition. Concatenate foo

without column time

with the same generated dataframe as above:



merge(foo[c('id','y')], 
      with(foo, 
           data.frame(id=rep(id,time), time=sequence(time))
      )
) 
##    id y time
## 1   1 0    1
## 2   1 0    2
## 3   2 1    1
## 4   2 1    2
## 5   2 1    3
## 6   3 0    1
## 7   3 0    2
## 8   3 0    3
## 9   3 0    4
## 10  4 1    1
## 11  4 1    2
## 12  5 0    1
## 13  5 0    2
## 14  5 0    3

      

You do not need to specify all

(or all.y

) in the last expression because id

there are multiple values for each comparable value time

and they are expanded. In the previous case, the values time

were matched against both data frames, and without specifying all

(or all.y

) you would get your original data.

+3


source


Initial expansion can be achieved with:

newdat <- transform( 
  foo[rep(rownames(foo),foo$time),], 
  time = sequence(foo$time)
)

#    id y time
#1    1 0    1
#1.1  1 0    2
#2    2 1    1
#2.1  2 1    2
#2.2  2 1    3
# etc

      

To get the complete solution, including the bonus part, follow these steps:



newdat$y[-cumsum(foo$time)] <- 0

#    id y time
#1    1 0    1
#1.1  1 0    2
#2    2 0    1
#2.1  2 0    2
#2.2  2 1    3
#etc

      

If you are really horny, you can do it all in one step using within

:

within(
  foo[rep(rownames(foo),foo$time),],
  {
    time <- sequence(foo$time)
    y[-cumsum(foo$time)] <- 0
  }
)

      

+3


source


If you want to go with "data.table" you can try:

library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
#     id y time
#  1:  1 0    1
#  2:  1 0    2
#  3:  2 1    1
#  4:  2 1    2
#  5:  2 1    3
#  6:  3 0    1
#  7:  3 0    2
#  8:  3 0    3
#  9:  3 0    4
# 10:  4 1    1
# 11:  4 1    2
# 12:  5 0    1
# 13:  5 0    2
# 14:  5 0    3

      

And for the bonus question:

fooDT[, list(time = sequence(time)), 
      by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y}, 
                        by = id][]
#     id y time
#  1:  1 0    1
#  2:  1 0    2
#  3:  2 0    1
#  4:  2 0    2
#  5:  2 1    3
#  6:  3 0    1
#  7:  3 0    2
#  8:  3 0    3
#  9:  3 0    4
# 10:  4 0    1
# 11:  4 1    2
# 12:  5 0    1
# 13:  5 0    2
# 14:  5 0    3

      

For a bonus question, alternatively:

fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1, 
                y := c(rep.int(0, .N-1L), 1), by=id][]

      

+3


source


With dplyr (and magritte for good legibility):

library(magrittr)
library(dplyr)

foo[rep(1:nrow(foo), foo$time), ] %>%
    group_by(id) %>%
    mutate(y = !duplicated(y, fromLast = TRUE),
                  time = 1:n())

      

Hope it helps

0


source







All Articles