Given only a vector of dates, expand the data between (unequal) date points

Other questions are centered around start and end dates. (see the following examples Given a start date and an end date, change / expand the data for every day between (every day in a row) Expand rows by date using start and end dates

My question is different in that I only have one date column and I would like to convert unequal date ranges to daily counts. This particular example created deals with the number of workers in a workplace at one time. Different teams of people come on different dates.

Below is a brief information frame:

dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))

      

I would like the data to look like this:

dw <- data.frame(date=c(seq(as.Date("1999-03-22"),as.Date("1999-04-10"),by= "day")),
       work=c(rep(43,7),rep(95,10),rep(92,3)),
       cumwork=c(rep(43,7),rep(138,10),rep(230,3)))

      

I've been stuck on this for a while. Any help would be appreciated!

UPDATE (7/5/2017): As @Scarabee pointed out, dates in dataframe 'dd' must be in date format. Please update your code to reflect this.

+3


source to share


3 answers


Possible way:

First, create a sequence of dates of interest as a single data frame:

v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))

      

Then, attach to the original dataframe and fill in the missing values, for example using dplyr

and zoo

:



library(dplyr)
library(zoo)

v %>% 
  left_join(dd, by = "date") %>% 
  na.locf

      

NB . I believe your dataframe dd

actually contains dates (not factors).

dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))

      

+1


source


A similar solution with base R (and zoo

):

dd$date <- as.Date(as.character(dd$date))
my.seq <- data.frame(date=seq.Date(from=range(dd$date)[1], to=range(dd$date)[2], by="day"))
output <- merge(my.seq, dd, all.x=TRUE)
output <- zoo::na.locf(output)

      



First you need to convert your date to date format. Then separately create a vector of full dates and combine it with the original data. Finally, run the "last observation carried forward" algorithm.

0


source


Here's a really quick solution for basic R:

ExpandDates <- function(df, lastColRepeat) {
    myDiff <- diff(df$date)
    dfOut <- data.frame(df$date[1] + 0:(sum(myDiff) + lastColRepeat - 1L),
                     stringsAsFactors=FALSE)
    myDiff <- c(myDiff, lastColRepeat)
    for (i in 2:3) {dfOut[,i] <- rep(df[ ,i], times = myDiff)}
    names(dfOut) <- names(df)
    dfOut
}

      

The last argument is to determine the number of times to repeat the last value. However, there is nothing in the original data.frame to give this value. I'm also guessing that the "date" field is actually the date denoted by @Scarabee.

Here's some test data:

set.seed(123)
workVec <- sample(5000, 3000)
testDF <- data.frame(date = as.Date(sort(sample(12000, 3000)), 
                                    origin = "1970-01-01"), work = workVec, 
                                                    cumwork = cumsum(workVec))

DplyrTest <- function(dd) {  ## from @Scarabee
    v <- data.frame(date = seq(min(dd$date), max(dd$date), by="day"))
    v %>% 
        left_join(dd, by = "date") %>% 
        na.locf
}

a <- ExpandDates(testDF, 1)
b <- DplyrTest(testDF)

      

Equality check:

identical(a$cumwork, as.integer(b$cumwork))
[1] TRUE
identical(a$work, as.integer(b$work))
[1] TRUE
identical(a$date, as.Date(b$date))
[1] TRUE

      

Landmarks:

library(microbenchmark)
microbenchmark(DplyrTest(testDF), ExpandDates(testDF,1))
Unit: milliseconds
                  expr       min        lq      mean    median        uq       max neval cld
     DplyrTest(testDF) 80.909303 84.337006 91.315057 86.320883 88.818739 173.69395   100   b
ExpandDates(testDF, 1)  1.122384  1.208184  2.521693  1.355564  1.486317  72.23444   100  a 

      

0


source







All Articles