How to count in R the number of records (i.e. rows, cells) in a column belonging to a combination of treatments in the same data frame?

This is my first question on this forum and I have limited experience with R, so I apologize if the question is somewhat unclear or too simple.

I have a dataframe called values ​​that consists of a sample number column, two factorial variables (H and W), and multiple columns of numbers (called number spacing after the slice) like this:

sample  H   W   (12.95,13]  (13,13.05]  (13.05,13.1]    (13.1,13.15]
130 bg  d   0   0   0   0
131 bg  d   0   0   0   0
132 bg  d   0   0   0   0
133 x   i   0   0   0   0
134 x   i   0   0   0   0
135 x   i   0   0   0   0
136 x   i   0   0   0   0
137 x   i   0   0   0   0
138 x   i   0   0   0   0
139 x   i   0   0   0   0
140 x   i   0   0   0   0
141 x   i   0   35947.65    0   0
142 x   i   0   0   0   0
143 x   i   0   0   0   0
144 x   i   0   0   0   0
145 x   i   0   0   0   73709.67
146 x   i   0   0   0   0
147 x   i   21756.63    0   32362.41    0
148 x   i   0   0   0   0
149 x   i   0   0   0   0
150 x   i   0   0   0   0
151 x   i   0   0   0   0
152 x   c   0   0   0   0
153 x   c   0   0   0   0
154 x   c   0   0   0   0
155 x   c   0   0   0   32578.03

      

I need to count how many rows in each column the numbers for each treatment combination and sample numbers are greater than 0. I tried the aggregation, counting and summing functions but have no success so far.

Can anyone help me with this?

Thank!

+3


source to share


4 answers


With a data table (and assuming that df

is your dataframe):

library(data.table)
setDT(df)[`colname`>0, .N, by=list(H, W, sample)]

      

or



setDT(df)[`colname`>0, .N, by=list(H, W)]

      

if you don't care sample

.

Where you have to replace colname

with the name of the specific column you are looking at. It would be easier for me to check if you provided a reproducible example .

+2


source


#replicable example
set.seed(123)
values <- data.frame(sample=1:100,
                     a=rep(1,100),
                     b=rep(c(1,2),50),
                     v1=rbinom(100,1,.1) * runif(100),
                     v2=rbinom(100,1,.1) * runif(100),
                     v3=rbinom(100,1,.1) * runif(100)
                     )

aggregate(cbind(v1, v2, v3) ~ a + b, # apply fcn to LHS grouped by RHS
          data=values,              
          FUN=function(x) sum(x>0)  # sum of TRUE v>0 is count of v greater than 0 
          )
#   a b v1 v2 v3
# 1 1 1  4  4  7
# 2 1 2  3  6  2

      



0


source


I may not have understood this (my solution seems very simple), but I just apply the sum of non-0 columns to the rows. The output is a numeric vector with a length equal to the number of rows in your data, where:

  • 0 means there is no column with a value other than 0
  • 1 means that there is at least one column with a value greater than 0, etc.

     apply(!df[, 4:7] == 0, 1, sum)
    
    [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0 0 0 0 0 1
    
          

0


source


Not enough solution using plyr

(I'm sure the package dplyr

can do an even better job, but I'm less familiar with it)

The downside is that the sums have to be calculated for each column separately - if there are 3 or 4, that's ok, but for 100 intervals that would not be viable.

##Generate fake data with 3 samples, 2 factors 3 levels each 
##and 3 observations per combination
df <- expand.grid(sample = letters[1:3], 
                  f1 = paste0('x', 1:3), 
                  f2 = paste0('y', 1:3))
df <- rbind(df, df, df)
nums <- matrix(rnorm(4*nrow(df)), ncol = 4)
colnames(nums) <- paste0('val_', 1:4)
nums[nums < 1] <- 0
df <- cbind(df, nums)

##Summarize
require(plyr)
ddply(df, .(sample, f1, f2), summarize, 
           sum_1 = sum(val_1 > 0),
           sum_2 = sum(val_2 > 0))

      

0


source







All Articles