Given only a vector of dates, expand the data between (unequal) date points
Other questions are centered around start and end dates. (see the following examples Given a start date and an end date, change / expand the data for every day between (every day in a row) Expand rows by date using start and end dates
My question is different in that I only have one date column and I would like to convert unequal date ranges to daily counts. This particular example created deals with the number of workers in a workplace at one time. Different teams of people come on different dates.
Below is a brief information frame:
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
I would like the data to look like this:
dw <- data.frame(date=c(seq(as.Date("1999-03-22"),as.Date("1999-04-10"),by= "day")),
work=c(rep(43,7),rep(95,10),rep(92,3)),
cumwork=c(rep(43,7),rep(138,10),rep(230,3)))
I've been stuck on this for a while. Any help would be appreciated!
UPDATE (7/5/2017): As @Scarabee pointed out, dates in dataframe 'dd' must be in date format. Please update your code to reflect this.
source to share
Possible way:
First, create a sequence of dates of interest as a single data frame:
v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))
Then, attach to the original dataframe and fill in the missing values, for example using dplyr
and zoo
:
library(dplyr)
library(zoo)
v %>%
left_join(dd, by = "date") %>%
na.locf
NB . I believe your dataframe dd
actually contains dates (not factors).
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
source to share
A similar solution with base R (and zoo
):
dd$date <- as.Date(as.character(dd$date))
my.seq <- data.frame(date=seq.Date(from=range(dd$date)[1], to=range(dd$date)[2], by="day"))
output <- merge(my.seq, dd, all.x=TRUE)
output <- zoo::na.locf(output)
First you need to convert your date to date format. Then separately create a vector of full dates and combine it with the original data. Finally, run the "last observation carried forward" algorithm.
source to share
Here's a really quick solution for basic R:
ExpandDates <- function(df, lastColRepeat) {
myDiff <- diff(df$date)
dfOut <- data.frame(df$date[1] + 0:(sum(myDiff) + lastColRepeat - 1L),
stringsAsFactors=FALSE)
myDiff <- c(myDiff, lastColRepeat)
for (i in 2:3) {dfOut[,i] <- rep(df[ ,i], times = myDiff)}
names(dfOut) <- names(df)
dfOut
}
The last argument is to determine the number of times to repeat the last value. However, there is nothing in the original data.frame to give this value. I'm also guessing that the "date" field is actually the date denoted by @Scarabee.
Here's some test data:
set.seed(123)
workVec <- sample(5000, 3000)
testDF <- data.frame(date = as.Date(sort(sample(12000, 3000)),
origin = "1970-01-01"), work = workVec,
cumwork = cumsum(workVec))
DplyrTest <- function(dd) { ## from @Scarabee
v <- data.frame(date = seq(min(dd$date), max(dd$date), by="day"))
v %>%
left_join(dd, by = "date") %>%
na.locf
}
a <- ExpandDates(testDF, 1)
b <- DplyrTest(testDF)
Equality check:
identical(a$cumwork, as.integer(b$cumwork))
[1] TRUE
identical(a$work, as.integer(b$work))
[1] TRUE
identical(a$date, as.Date(b$date))
[1] TRUE
Landmarks:
library(microbenchmark)
microbenchmark(DplyrTest(testDF), ExpandDates(testDF,1))
Unit: milliseconds
expr min lq mean median uq max neval cld
DplyrTest(testDF) 80.909303 84.337006 91.315057 86.320883 88.818739 173.69395 100 b
ExpandDates(testDF, 1) 1.122384 1.208184 2.521693 1.355564 1.486317 72.23444 100 a
source to share