Subset of data by day according to most non-null data

Question

Subset of data by day according to most non-null data

I have an example frame:

a <- c(1:6)
b <- c("05/12/2012 05:00","05/12/2012 06:00","06/12/2012 05:00",
   "06/12/2012 06:00", "07/12/2012 09:00","07/12/2012 07:00")
c <-c("0","0","0","1","1","1")
df1 <- data.frame(a,b,c,stringsAsFactors = FALSE)

First, I want to make sure R recognizes the date and time format, so I used:

df1$b <- strptime(df1$b, "%d/%m/%Y %H:%M")

However, this is not the case as R always aborts my session as soon as I try to browse a new framework.

Assuming this is getting permission, I want to get a subset of the data depending on which day in the data frame contains most of the data in 'C' that is not null. In the example above, I should be left with two data points on December 7, 2012.

I also have an additional, related question.
If I want to leave the subset of the data with the most non-zero values between a certain period of time per day (for example, from 07:00 to 08:00), how would I do it?

Any help on the above issues would be greatly appreciated.

+1

r

KT_1 06 Feb At 17:57

source to share

2 answers

Agreeing with Jack. It looks like a flawed R installation. First try deleting the file .Rdata

containing the results of the previous session. They are hidden on both Mac and Windows, so if you don't find "dotfiles" (system files), the OS file manager (Finder.app and Windows Explorer) won't show them. How you find and delete this file is an OS specific task. This will be in your working directory, and you will need to do the uninstall outside of R, since after starting R it will have access to it. It is also possible to get a corrupted file .history

, but in my experience that is not usually the source of the problem.

If that fails, you may need to reinstall R.

+2

42- 06 Feb 13 at 18:08

source to share

Dinre · Accepted Answer · 2013-02-07T02:51:00+0000

Okay, the good news is I have an answer for you, and the bad news is that you have more questions to ask yourself. The bad news first: you need to consider how you want to handle multiple days that have the same number of non-zero values for "c". I'm not going to address this in this answer.

Now the good news: it's really simple.

Step 1 . First, format your data frame. Since we are changing data types for a couple of variables (from b to datetime and c to numeric), we need to create a new data frame or recalibrate the old one. I prefer to keep the original and create a new one, for example:

a <- df1$a
b <- strptime(df1$b, "%d/%m/%Y %H:%M")
c <- as.numeric(df1$c)
hour <- as.numeric(format(b, "%H"))
date <- format(b, "%x")

df2 <- data.frame(a, b, c, hour, date)

#   a                   b c hour      date
# 1 1 2012-12-05 05:00:00 0    5 12/5/2012
# 2 2 2012-12-05 06:00:00 0    6 12/5/2012
# 3 3 2012-12-06 05:00:00 0    5 12/6/2012
# 4 4 2012-12-06 06:00:00 1    6 12/6/2012
# 5 5 2012-12-07 09:00:00 1    9 12/7/2012
# 6 6 2012-12-07 07:00:00 1    7 12/7/2012

Note that I've also added the "hour" and "date" variables. This makes it easy to sort our data by these fields for our later aggregation function.

Step 2 . Now calculate how many non-zero values there are for each day between the hours from 06:00 to 08:00. Since we are using the "hour" values, this means the values "6" and "7" (represents 06:00 - 07:59).

library(plyr)
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c))

#   a                   b c hour      date non_zero
# 1 2 2012-12-05 06:00:00 0    6 12/5/2012        0
# 2 4 2012-12-06 06:00:00 1    6 12/6/2012        1
# 3 6 2012-12-07 07:00:00 1    7 12/7/2012        1

The "plyr" package is great for this kind of thing. The 'ddply' package specifically uses dataframes for both input and output data (hence "dd"), and the "mutate" function allows us to preserve all data when additional columns are added. In this case, we want to get the amount "c" for each day in .(date)

. The substitution of our data for the clock is accounted for in the data argument df2[df2$hour %in% 6:7,]

, which says to show us the rows where the hour value is in the set {6,7}.

Step 3 . The last step is just a subset of the data for the maximum number of non-zero values. We can remove the extra columns we used and revert to our original three.

subset_df <- df2[df2$non_zero==max(df2$non_zero),1:3]

#   a                   b c
# 2 4 2012-12-06 06:00:00 1
# 3 6 2012-12-07 07:00:00 1

Good luck!

Update . In the OP's request, I am writing a new function "ddply" which will also contain time columns to plot.

df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c), plot_time=as.numeric(format(b, "%H")) + as.numeric(format(b, "%M")) / 60)
subset_df <- df2[df2$non_zero==max(df2$non_zero),c("a","b","c","plot_time")]

We need to roll the time down one continuous variable, so I chose the clock. Leaving any data in temporary format will take us longer to fiddle with, and using a string format (like "hh: mm") will limit the types of functions you can use on it. Continuous numbers are the most flexible, so we get the number of hours as.numeric(format(b, "%H"))

and add it to the number of minutes divided by 60 as.numeric(format(b, "%M")) / 60

to convert minutes to hours. Also, since we are dealing with a large number of columns, I have included a final subset operator to name the columns I want rather than refer to numbers. Once I come across columns that are not in continuous order, I find the use of names easier to debug.

Subset of data by day according to most non-null data

More articles: