Subset of data for most daily records

I am working with a large dataset, an example can be shown below. For most of the individual files, I will have to process data that should be more than one day long.

Date <- c("05/12/2012 05:00:00", "05/12/2012 06:00:00", "05/12/2012 07:00:00",
          "05/12/2012 08:00:00", "06/12/2012 07:00:00", "06/12/2012 08:00:00", 
          "07/12/2012 05:00:00", "07/12/2012 06:00:00", "07/12/2012 07:00:00",
          "07/12/2012 08:00:00")
Date <- strptime(Date, "%d/%m/%Y %H:%M")
c <- c("0","1","5","4","6","8","0","3","10","6")
c <- as.numeric(c)
df1 <- data.frame(Date,c,stringsAsFactors = FALSE)

      

I only want to leave data for one day. This day will be selected as having the most data points for that day. If, for some reason, two days are bound (with the maximum number of data points), I want to select the day that records the highest individual value.

In the above example data, I stayed from December 7th. It has 4 data points (like Dec 5), but has the highest value recorded in those two days (i.e. 10).

+3


source to share


3 answers


Here's a solution with tapply

.



# count rows per day and find maximum c value
res <- with(df1, tapply(c, as.Date(Date), function(x) c(length(x), max(x))))

# order these two values in decreasing order and find the associated day
# (at top position):
maxDate <- names(res)[order(sapply(res, "[", 1), 
                            sapply(res, "[", 2), decreasing = TRUE)[1]]

# subset data frame:
subset(df1, as.character(as.Date(Date)) %in% maxDate)

                  Date  c
7  2012-12-07 05:00:00  0
8  2012-12-07 06:00:00  3
9  2012-12-07 07:00:00 10
10 2012-12-07 08:00:00  6

      

+4


source


A data.table

solution:



dt <- data.table(df1)
# get just the date
dt[, day := as.Date(Date)]
setkey(dt, "day")
# get total entries (N) and max(c) for each day-group
dt <- dt[, `:=`(N = .N, mc = max(c)), by=day]
setkey(dt, "N")
# filter by maximum of N
dt <- dt[J(max(N))]
setkey(dt, "mc")
# settle ties with maximum of c
dt <- dt[J(max(mc))]
dt[, c("N", "mc", "day") := NULL]
print(dt)

#                   Date  c
# 1: 2012-12-07 05:00:00  0
# 2: 2012-12-07 06:00:00  3
# 3: 2012-12-07 07:00:00 10
# 4: 2012-12-07 08:00:00  6

      

+4


source


And to be complete, here's one with plyr

:

library(plyr)
df1$day <- strftime(df1$Date, "%d/%m/%Y")
tmp <- ddply(df1[,c("day","c")], .(day), summarize, nb=length(c), max=max(c))
tmp <- tmp[order(tmp$nb, tmp$max, decreasing=TRUE),]
df1[df1$day==tmp$day[1],]

      

What gives:

                  Date  c        day
7  2012-12-07 05:00:00  0 07/12/2012
8  2012-12-07 06:00:00  3 07/12/2012
9  2012-12-07 07:00:00 10 07/12/2012
10 2012-12-07 08:00:00  6 07/12/2012

      

+3


source







All Articles