Stratified sampling or proportional sampling in R

I have a dataset generated like this:

myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))

      

The data looks like

enter image description here

I would like to create a stratified set of samples myData

with a given sample size, that is 50. The resulting set of samples should match the distribution of the proportions of the original dataset in terms of "group". For example, suppose you myData

have 20 records that belong to group 4, then the resulting dataset should have records 50*20/200=5

that belong to group 4. How to do this in R.

0


source to share


1 answer


You can use my stratified

function
by giving a value <1 as your proportion, for example:

## Sample data. Seed for reproducibility 
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))

## Taking the sample
out <- stratified(myData, "group", .3)
out
#     a     b group
# 17 17 -0.02     2
# 8   8  0.74     3
# 25 25  0.62     3
# 49 49 -0.11     3
# 4   4  1.60     3
# 26 26 -0.06     4
# 27 27 -0.16     4
# 7   7  0.49     4
# 12 12  0.39     4
# 40 40  0.76     4
# 32 32 -0.10     4
# 9   9  0.58     5
# 42 42 -0.25     5
# 43 43  0.70     5
# 37 37 -0.39     5
# 11 11  1.51     6

      

Compare the numbers in the final group with what we expected.

round(table(myData$group) * .3)
# 
# 2 3 4 5 6 
# 1 4 6 4 1 
table(out$group)
# 
# 2 3 4 5 6 
# 1 4 6 4 1 

      




You can also easily take a fixed number of samples per group, for example:

stratified(myData, "group", 2)
#     a     b group
# 34 34 -0.05     2
# 17 17 -0.02     2
# 49 49 -0.11     3
# 22 22  0.78     3
# 12 12  0.39     4
# 7   7  0.49     4
# 18 18  0.94     5
# 33 33  0.39     5
# 45 45 -0.69     6
# 11 11  1.51     6

      

+1


source







All Articles