How to group data based on time interval in R
I have data that looks like this:
library(plyr)
dates<-data.frame(datecol=as.POSIXct(c(
"2010-04-03 03:02:38 UTC",
"2010-04-03 03:03:14 UTC",
"2010-04-20 03:05:52 UTC",
"2010-04-20 03:07:42 UTC",
"2010-04-21 03:09:38 UTC",
"2010-04-21 03:10:14 UTC",
"2010-04-21 03:12:52 UTC",
"2010-04-23 03:13:42 UTC",
"2010-04-23 03:15:42 UTC",
"2010-04-23 03:16:38 UTC",
"2010-04-23 03:18:14 UTC",
"2010-04-24 03:21:52 UTC",
"2010-04-24 03:22:42 UTC",
"2010-04-24 03:24:19 UTC",
"2010-04-24 03:25:19 UTC"
)), x = cumsum(runif(15)*10),y=cumsum(runif(15)*20))
I want to group my data into 5 day intervals, so all points that are 5 days apart or less are placed in one group. I tried what was suggested here :
gr<-ddply(dates,.(cut(datecol,"5 day",include.lowest = TRUE)),"[")
But for some reason I end up with 3 groups instead of two, and the points from 04/21 and 04/23 fall into separate groups, even though they are less than 5 days old.
This is what I would like to get:
group datecol x y
1 1 2010-04-03 03:02:38 8.112423 4.790036
2 1 2010-04-03 03:03:14 11.184709 22.903475
3 2 2010-04-20 03:05:52 17.306835 32.286891
4 2 2010-04-20 03:07:42 24.071488 38.941709
5 2 2010-04-21 03:09:38 26.451493 48.378477
6 2 2010-04-21 03:10:14 33.090645 53.148149
7 2 2010-04-21 03:12:52 38.536416 64.346574
8 2 2010-04-23 03:13:42 40.911074 79.419002
9 2 2010-04-23 03:15:42 41.977579 89.760210
10 2 2010-04-23 03:16:38 46.838773 95.266709
11 2 2010-04-23 03:18:14 48.367159 112.619969
12 2 2010-04-24 03:01:52 57.470412 113.594423
13 2 2010-04-24 03:02:42 63.202005 123.653370
14 2 2010-04-24 03:04:19 65.615348 137.184153
15 2 2010-04-24 03:25:19 75.177633 137.559003
source to share
How about cumsum
one that checks for lagging values ββand updates if needed? We are using a function shift()
from the library data.table
for lags.
library(data.table)
dates$group <- cumsum(ifelse(difftime(dates$datecol,
shift(dates$datecol, fill = dates$datecol[1]),
units = "days") >= 5
,1, 0)) + 1
head(dates)
# datecol x y group
#1 2010-04-03 03:02:38 4.776196 5.160336 1
#2 2010-04-03 03:03:14 13.388291 14.731241 1
#3 2010-04-20 03:05:52 17.769262 30.057454 2
#4 2010-04-20 03:07:42 20.217235 31.742392 2
#5 2010-04-21 03:09:38 20.924025 49.248819 2
#6 2010-04-21 03:10:14 21.918687 56.030278 2
This assumes that your data is sorted by time from smallest to largest
source to share
You can set the breaks manually so that they are anchored to whatever original date you want. For example:
library(lubridate)
start.date = ymd_hms("2010-04-20 00:00:00")
breaks = seq(start.date - 30*3600*24, start.date + 30*3600*24, "5 days")
dates$group5 = cut(dates$datecol, breaks=breaks)
datecol x y group5
1 2010-04-03 03:02:38 7.265758 10.80777 2010-03-31
2 2010-04-03 03:03:14 15.632081 13.57187 2010-03-31
3 2010-04-20 03:05:52 19.219491 19.76293 2010-04-20
4 2010-04-20 03:07:42 20.605199 37.22687 2010-04-20
5 2010-04-21 03:09:38 26.533445 53.90345 2010-04-20
6 2010-04-21 03:10:14 33.449645 56.27885 2010-04-20
7 2010-04-21 03:12:52 39.050517 71.74788 2010-04-20
8 2010-04-23 03:13:42 39.499227 76.92669 2010-04-20
9 2010-04-23 03:15:42 44.827766 79.49207 2010-04-20
10 2010-04-23 03:16:38 54.206473 89.60895 2010-04-20
11 2010-04-23 03:18:14 54.982695 94.37855 2010-04-20
12 2010-04-24 03:21:52 64.414931 104.24174 2010-04-20
13 2010-04-24 03:22:42 64.659980 113.77616 2010-04-20
14 2010-04-24 03:24:19 67.343105 128.06813 2010-04-20
15 2010-04-24 03:25:19 71.060741 138.43512 2010-04-20
source to share