How to randomly select rows of data with unique column values
The ultimate goal is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from avg_score
for 10 random-sized samples per word from a dataset similar to:
library (data.table)
set.seed(1)
df <- data.frame(
word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)),
true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
item_score = runif(40),
avg_score = rnorm(40),
line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
validity = sample(c("T", "F"), 40, replace = T)
)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt
<df
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
4: 1 cat 234 234 0.93296878 -0.19981748 34 T 4 2
5: 2 house 39 34 0.49471189 0.68924373 25 F 1 0
6: 2 house 39 34 0.64499368 0.03614551 32 T 2 1
7: 2 house 39 34 0.17580259 1.94353631 573 F 3 0
8: 3 sung 876 876 0.60299465 0.73721373 3 T 1 1
9: 3 sung 876 876 0.88775767 2.32133393 673 F 2 0
10: 4 door 4 4 0.49020940 0.34890935 899 T 1 1
11: 4 door 4 4 0.01838357 -1.13391666 904 T 2 2
The data represents each occurrence of a word in the document, so it is possible for a word to appear more than once on the same line. The challenge is for the sample size to represent unique column values ββ( line
), but return all instances where the row number is the same, which means that the actual number of rows returned may be greater than the given sample size. So for a single two-word sample size test for "cat" the shape of the desired result would look like this:
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
My main iteration (found on this site) currently looks like this:
for (i in 1:10) {
a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])
b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}
}
So, I can make standard sizes of random samples, but I am not sure (and could not find something similar or did not look for the right way) how to approach the goal stated above. Is there a direct way to approach this?
Thank,
source to share
Here is a data.table solution that uses a join on a discrete data table.
set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]
The internal data table consists of two columns, a word and a row, and has two rows for each word, each with a unique meaning for the row. The values ββfor the string are returned by a character sample
that gets the unique string values ββand is done separately for each word (using by=word
). You can change the number of unique line values ββby changing 2 to your desired value. This data table connects to the main data table to select the desired rows.
In this case, you get
word_ID word true_count occurrences item_score avg_score line validity
1: 1 cat 234 234 0.26550866 0.91897737 71 F
2: 1 cat 234 234 0.57285336 0.07456498 71 T
3: 1 cat 234 234 0.37212390 0.78213630 234 T
4: 2 house 39 34 0.89838968 -0.05612874 32 T
5: 2 house 39 34 0.94467527 -0.15579551 573 F
6: 3 sung 876 876 0.62911404 -0.47815006 673 T
7: 3 sung 876 876 0.66079779 -1.47075238 3 T
8: 4 door 4 4 0.06178627 0.41794156 899 F
9: 4 door 4 4 0.38410372 -0.05380504 55 F
10: 5 pretty 67 65 0.71761851 -0.39428995 100 F
11: 5 pretty 67 65 0.38003518 1.10002537 100 F
12: 5 pretty 67 65 0.49769924 -0.41499456 1003 F
13: 6 towel 81 81 0.21214252 -0.25336168 326 F
14: 6 towel 81 81 0.93470523 -0.16452360 87 F
15: 7 car 90 90 0.12555510 0.55666320 32 T
16: 7 car 90 90 0.26722067 -0.68875569 54 F
17: 8 island 43 43 0.01339033 0.36458196 87 T
18: 8 island 43 43 0.38238796 0.76853292 988 F
19: 8 island 43 43 0.86969085 -0.11234621 988 T
20: 9 ran 54 54 0.59956583 -0.61202639 754 F
21: 9 ran 54 54 0.82737332 1.43302370 4276 F
22: 10 pizza 53 51 0.79423986 -0.36722148 93 F
23: 10 pizza 53 51 0.41127443 -0.13505460 49 T
word_ID word true_count occurrences item_score avg_score line validity
source to share
If you select from the deduplicated data.frame and do a subsequent left join with the original data, you can provide what you need.
I am not proficient data.table
, so I will use basic functions. ( dplyr
Will work fine too, but since you are using data.table
I won't do that for now.) (Since I'm about to hit submit, @lmo provided a dt
-specific answer ...)
By "de-duplicate" I mean:
subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36 2
head(subdf)
# word_ID line
# 1 1 71
# 2 1 234
# 4 1 34
# 5 2 25
# 6 2 32
# 7 2 573
Note that subdf
it only has three rows for 1, whereas the original data has 4:
df[1:4,]
# word_ID word true_count occurrences item_score avg_score line validity
# 1 1 cat 234 234 0.2655087 0.91897737 71 F
# 2 1 cat 234 234 0.3721239 0.78213630 234 T
# 3 1 cat 234 234 0.5728534 0.07456498 71 T
# 4 1 cat 234 234 0.9082078 -1.98935170 34 T
I'm using by
here instead lapply/split
, but the results should be the same:
out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
# word_ID line word true_count occurrences item_score avg_score validity
# 1 1 34 cat 234 234 0.9082078 -1.98935170 T
# 2 1 71 cat 234 234 0.5728534 0.07456498 T
# 3 1 71 cat 234 234 0.2655087 0.91897737 F
source to share