Sampling a small data frame from a large data block

I am trying to sample a dataframe from a given dataframe so that there are enough samples from each of the levels of the variable. This can be achieved by dividing the data frame into layers and sampling from each of them. I thought that ddply

(dataframe to dataframe) would do it for me. Taking a minimal example:

set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2 
30 32 38

      

The following commands fetch ...

When I enter ...

data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))

      

I am getting the following error

   Error in `[.data.frame`(x, .Internal(sample(length(x), size, replace,  : 
  cannot take a sample larger than the population when 'replace = FALSE'

      

This error occurs due to the fact that x

inside the function ddply

it is not a vector, but a data frame.

Does anyone have any ideas on how to achieve this sample? I know one way is to not use ddply and just do (1) segregation, (2) fetch, and (3) sort in three steps. But I was wondering, somehow ... with basic or plyr

functions ...

Thanks for the help...

+3


source to share


2 answers


I think you want to multiply the dataframe passed in x

with sample

:

ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])

      



But of course, you still need to take care that the sample size for each slice (in this case 20) is at least as large as the smallest subset of your level-based data a

.

+5


source


It would seem that if you wanted to select a category with less than 20 lines, you would need to replace = TRUE ...

This might do the trick:



ddply(data1,'a',function(x) x[sample.int(NROW(x),20,replace=TRUE),])

      

+3


source