Subset of data for most daily records

Question

Subset of data for most daily records

I am working with a large dataset, an example can be shown below. For most of the individual files, I will have to process data that should be more than one day long.

Date <- c("05/12/2012 05:00:00", "05/12/2012 06:00:00", "05/12/2012 07:00:00",
          "05/12/2012 08:00:00", "06/12/2012 07:00:00", "06/12/2012 08:00:00", 
          "07/12/2012 05:00:00", "07/12/2012 06:00:00", "07/12/2012 07:00:00",
          "07/12/2012 08:00:00")
Date <- strptime(Date, "%d/%m/%Y %H:%M")
c <- c("0","1","5","4","6","8","0","3","10","6")
c <- as.numeric(c)
df1 <- data.frame(Date,c,stringsAsFactors = FALSE)

I only want to leave data for one day. This day will be selected as having the most data points for that day. If, for some reason, two days are bound (with the maximum number of data points), I want to select the day that records the highest individual value.

In the above example data, I stayed from December 7th. It has 4 data points (like Dec 5), but has the highest value recorded in those two days (i.e. 10).

+3

r dataframe subset

KT_1 11 Feb At 13:07

source to share

3 answers

A data.table

solution:

dt <- data.table(df1)
# get just the date
dt[, day := as.Date(Date)]
setkey(dt, "day")
# get total entries (N) and max(c) for each day-group
dt <- dt[, `:=`(N = .N, mc = max(c)), by=day]
setkey(dt, "N")
# filter by maximum of N
dt <- dt[J(max(N))]
setkey(dt, "mc")
# settle ties with maximum of c
dt <- dt[J(max(mc))]
dt[, c("N", "mc", "day") := NULL]
print(dt)

#                   Date  c
# 1: 2012-12-07 05:00:00  0
# 2: 2012-12-07 06:00:00  3
# 3: 2012-12-07 07:00:00 10
# 4: 2012-12-07 08:00:00  6

+4

Arun 11 Feb 13 at 13:17

source to share

And to be complete, here's one with plyr

:

library(plyr)
df1$day <- strftime(df1$Date, "%d/%m/%Y")
tmp <- ddply(df1[,c("day","c")], .(day), summarize, nb=length(c), max=max(c))
tmp <- tmp[order(tmp$nb, tmp$max, decreasing=TRUE),]
df1[df1$day==tmp$day[1],]

What gives:

                  Date  c        day
7  2012-12-07 05:00:00  0 07/12/2012
8  2012-12-07 06:00:00  3 07/12/2012
9  2012-12-07 07:00:00 10 07/12/2012
10 2012-12-07 08:00:00  6 07/12/2012

+3

juba 11 Feb 13 at 13:33

source to share

Sven Hohenstein · Accepted Answer · 2013-02-11T13:26:02+0000

Here's a solution with tapply

.

# count rows per day and find maximum c value
res <- with(df1, tapply(c, as.Date(Date), function(x) c(length(x), max(x))))

# order these two values in decreasing order and find the associated day
# (at top position):
maxDate <- names(res)[order(sapply(res, "[", 1), 
                            sapply(res, "[", 2), decreasing = TRUE)[1]]

# subset data frame:
subset(df1, as.character(as.Date(Date)) %in% maxDate)

                  Date  c
7  2012-12-07 05:00:00  0
8  2012-12-07 06:00:00  3
9  2012-12-07 07:00:00 10
10 2012-12-07 08:00:00  6

Subset of data for most daily records

More articles: