Sampling a small data frame from a large data block
I am trying to sample a dataframe from a given dataframe so that there are enough samples from each of the levels of the variable. This can be achieved by dividing the data frame into layers and sampling from each of them. I thought that ddply
(dataframe to dataframe) would do it for me. Taking a minimal example:
set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2
30 32 38
The following commands fetch ...
When I enter ...
data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))
I am getting the following error
Error in `[.data.frame`(x, .Internal(sample(length(x), size, replace, :
cannot take a sample larger than the population when 'replace = FALSE'
This error occurs due to the fact that x
inside the function ddply
it is not a vector, but a data frame.
Does anyone have any ideas on how to achieve this sample? I know one way is to not use ddply and just do (1) segregation, (2) fetch, and (3) sort in three steps. But I was wondering, somehow ... with basic or plyr
functions ...
Thanks for the help...
source to share
I think you want to multiply the dataframe passed in x
with sample
:
ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
But of course, you still need to take care that the sample size for each slice (in this case 20) is at least as large as the smallest subset of your level-based data a
.
source to share