Subset of df with repeating sequences
I have searched high and low for a solution to this question, but I cannot find it ...
My dataframe (essentially a # 1 sports team table by date) has numerous cases where one or more teams will "re-appear" in the data. I want to output the start (or end) date of each period to no. 1 per team.
An example of data could be:
x1<- as.Date("2013-12-31")
adddate1 <- 1:length(teams1)
dates1 <- x1 + adddate1
teams2 <- c(rep("w", 3), rep("c", 8), rep("w", 4))
x2<- as.Date("2012-12-31")
adddate2 <- 1:length(teams2)
dates2 <- x2 + adddate2
dates <- c(dates2, dates1)
teams <- c(teams2, teams1)
df <- data.frame(dates, teams)
df$year <- year(df$dates)
which for 2013 looks like this:
dates teams year
1 2013-01-01 w 2013
2 2013-01-02 w 2013
3 2013-01-03 w 2013
4 2013-01-04 c 2013
5 2013-01-05 c 2013
6 2013-01-06 c 2013
7 2013-01-07 c 2013
8 2013-01-08 c 2013
9 2013-01-09 c 2013
10 2013-01-10 c 2013
11 2013-01-11 c 2013
12 2013-01-12 w 2013
13 2013-01-13 w 2013
14 2013-01-14 w 2013
15 2013-01-15 w 2013
However, using ddply, it aggregates commands with the same name and returns the following:
split <- ddply(df, .(year, teams), head,1)
split <- split[order(split[,1]),]
dates teams year
2 2013-01-01 w 2013
1 2013-01-04 c 2013
3 2014-01-01 c 2014
4 2014-01-09 k 2014
Is there a more elegant way to do this than create a function that will loop through the original df and return a unique value for each subset, add that to df, and then use ddply including the new unique value to return what I want?
source to share
You say that some commands "reappear", at which point I thought the little helper function intergroup
from this answer might just be the right tool here. This is useful when you have commands in your case, for example. "w" that reappears in the same year, eg. 2013, after another team has been there for some time, for example. "FROM". Now if you want to treat each sequence of events in each command as separate groups in order to get the first or last date of that sequence, this is when this feature is useful. Note that if you only group the "and" commands, as usual, each command, eg. "w" can only have one first / last date (for example,when using "sum" in dplyr).
Define a function:
intergroup <- function(var, start = 1) {
cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}
Now group your data first by year and then additionally using the intergroup interaction feature in the command column:
library(dplyr)
df %>%
group_by(year) %>%
group_by(teamindex = intergroup(teams), add = TRUE) %>%
filter(dense_rank(dates) == 1)
Finally, you can filter according to your needs. Here, for example, I am filtering min dates. The result will be:
#Source: local data frame [3 x 4]
#Groups: year, teamindex
#
# dates teams year teamindex
#1 2013-01-01 w 2013 1
#2 2013-01-04 c 2013 2
#3 2013-01-12 w 2013 3
Note that the "w" command appears again because we are grouped by the "teamindex" command we created with the intergroup function.
Another option for filtering is to use (using an arrangement and then slice
):
df %>%
group_by(year) %>%
group_by(teamindex = intergroup(teams), add = TRUE) %>%
arrange(dates) %>%
slice(1)
The data I used is the answer from akrun.
source to share
You can also use rle
to create teamindex
.
library(dplyr)
df %>%
group_by(year) %>%
group_by(teamindex= with(rle(teams),
rep(seq_along(lengths), lengths)), add=TRUE) %>%
filter(dates==min(dates)) #or #filter(dates==max(dates))
# dates teams year teamindex
#1 2013-01-01 w 2013 1
#2 2013-01-04 c 2013 2
#3 2013-01-12 w 2013 3
Or
df %>%
group_by(year) %>%
group_by(teamindex= with(rle(teams),
rep(seq_along(lengths), lengths)), add=TRUE) %>%
arrange(dates) %>%
slice(n()) #or #slice(1)
# dates teams year teamindex
#1 2013-01-03 w 2013 1
#2 2013-01-11 c 2013 2
#3 2013-01-15 w 2013 3
data
df <- structure(list(dates = structure(c(15706, 15707, 15708, 15709,
15710, 15711, 15712, 15713, 15714, 15715, 15716, 15717, 15718,
15719, 15720), class = "Date"), teams = c("w", "w", "w", "c",
"c", "c", "c", "c", "c", "c", "c", "w", "w", "w", "w"), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), .Names = c("dates", "teams",
"year"), row.names = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15"), class = "data.frame")
source to share