R - setting the equiprobability for a certain variable when sampling
I have a dataset with over 2 million records that I am loading into a dataframe.
I am trying to grab a subset of the data. I need about 10,000 records, but I need records to be selected with equal probability for one variable.
This is how my data looks like with str(data)
:
'data.frame': 2685628 obs. of 3 variables:
$ category : num 3289 3289 3289 3289 3289 ...
$ id: num 8064180 8990447 747922 9725245 9833082 ...
$ text : chr "text1" "text2" "text3" "text4" ...
You noticed that I have 3 variables: category, id and text.
I've tried the following:
> sample_data <- data[sample(nrow(data),10000,replace=FALSE),]
Of course it works, but the probability of the sample, if not equal. Here's the output count(sample_data$category)
:
x freq
1 3289 707
2 3401 341
3 3482 160
4 3502 243
5 3601 1513
6 3783 716
7 4029 423
8 4166 21
9 4178 894
10 4785 31
11 5108 121
12 5245 2178
13 5637 387
14 5946 1484
15 5977 117
16 6139 664
Update: Here is the result count(data$category)
:
x freq
1 3289 198142
2 3401 97864
3 3482 38172
4 3502 59386
5 3601 391800
6 3783 201409
7 4029 111075
8 4166 6749
9 4178 239978
10 4785 6473
11 5108 32083
12 5245 590060
13 5637 98785
14 5946 401625
15 5977 28769
16 6139 183258
But when I try to set the probability, I get the following error:
> catCount <- length(unique(data$category))
> probabilities <- rep(c(1/catCount),catCount)
> train_set <- data[sample(nrow(data),10000,prob=probabilities),]
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I understand that the sampling function randomly selects the line number, but I cannot figure out how to relate this to the probability by category.
Question: How can I sample my data with equal probability for a category variable?
Thanks in advance.
source to share
I think you could do it with a simple simple R operation, although you have to remember that you are using the probabilities here within limits sample
, so getting the exact sum for each combination will not work using this method, although you can get close enough for enough large sample.
Here's a sample data
set.seed(123)
data <- data.frame(category = sample(rep(letters[1:10], seq(1000, 10000, by = 1000)), 55000))
Then
probs <- 1/prop.table(table(data$category)) # Calculating relative probabilities
data$probs <- probs[match(data$category, names(probs))] # Matching them to the correct rows
set.seed(123)
train_set <- data[sample(nrow(data), 1000, prob = data$probs), ] # Sampling
table(train_set$category) # Checking frequencies
# a b c d e f g h i j
# 94 103 96 107 105 99 100 96 107 93
Edit: So, a possible equivalentdata.table
library(data.table)
setDT(data)[, probs := .N, category][, probs := .N/probs]
train_set <- data[sample(.N, 1000, prob = probs)]
Edit # 2: Here's a very good solution using the package dplyr
provided by @Khashaa and @docendodiscimus
The best part about this solution is that it returns the exact sample size in each group
library(dplyr)
train_set <- data %>%
group_by(category) %>%
sample_n(1000)
Edit # 3:
It seems that the data.table
equivalent dplyr::sample_n
would be
library(data.table)
train_set <- setDT(data)[data[, sample(.I, 1000), category]$V1]
which will also return the exact sample size in each group
source to share