R data.table and iterating over two columns
I am new to R and am trying to solve the following problem:
There is a table with two columns books
and readers
these books where books
, and readers
- the book and the reader IDs, respectively:
> books = c (1,2,3,1,1,2)
> readers = c(30, 10, 20, 20, 10, 30)
> bt = data.table(books, readers)
> bt
books readers
1: 1 30
2: 2 10
3: 3 20
4: 1 20
5: 1 10
6: 2 30
For each book pair, I need to count the number of readers who read both of these books, with this algorithm:
for each book
for each reader of the book
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
To implement the above algorithm, I need to group this data into two lists: 1) a book list containing the readers of each book, and 2) a reader list containing the books each reader has read, such as:
> bookList = list(
+ list(1, list(30, 20, 10)),
+ list(2, list(10, 30)),
+ list(3, list(20))
+ )
>
> readerList = list (
+ list(30, list(1,2)),
+ list(20, list(3,1)),
+ list(10, list(2,1))
+ )
>
Questions:
1) What functions should I use to create these lists from the table of books?
2) From bookList
and readerList
how to pair books with the number of readers who have read both of these books? For the log table bt
above, the result should be:
((1, 2), 2) ((1,3), 1) ((2,3), 0)
The order of the books in a pair does not matter, therefore, for example, (1,2)
and (2,1)
should be reduced to one.
Please consult functions and data structures to resolve this issue. Thank!
Update:
Idealism as a result I need to get a matrix with the book id in both rows and columns. Intersection is the number of readers who read both books in pairs. So, for the above example matrix it should be:
books | 1 | 2 | 3 |
1 | 1 | 2 | 1 |
2 | 2 | 1 | 0 |
3 | 1 | 0 | 1 |
Which means:
book 1 and 2 are read together by 2 readers
book 1 and 3 are read together by 1 reader
book 2 and 3 are read together by 0 readers
How to build such a matrix?
source to share
Here's another option:
combs <- combn(unique(books), 2)# Generate combos of books
setkey(bt, books)
both.read <-bt[ # Cartesian join all combos to our data
data.table(books=c(combs), combo.id=c(col(combs))), allow.cartesian=T
][,
.( # For each combo, figure out how many readers show up twice, meaning they've read both books
read.both=sum(duplicated(readers)),
book1=min(books), book2=max(books)
),
by=combo.id
]
dcast.data.table( # dcast to desired format
both.read, book1 ~ book2, value.var="read.both", fun.aggregate=sum
)
Outputs:
book1 2 3
1: 1 2 1
2: 2 0 0
Note that by design, this only does non-equivalent combinations (i.e. we do not show books 1-2 and 2-1, only 1-2, since they are the same).
source to share
try this:
## gives you a seperate list for each book
list_bookls <- split(bt$readers, books)
## gives you a seperate list for each reader
list_readers <- split(bt$books, readers)
another form of output with output in the form of a data table. data and the number of books read by each reader and the number of books that each reader reads:
bt[ , .("N Books" = length(unique(books))), by = readers]
bt[ , .("N Readers" = length(unique(readers))), by = readers]
for the second part of your question, I would use the following:
bt2 <- bt[ , .N, by = .(readers, books)]
library(tidyr)
spread(bt2, key = books, value = "N", fill = 0)
The output is a table that gives 1 if the books are read by reader X and 0 otherwise:
readers 1 2 3
1: 10 1 1 0
2: 20 1 0 1
3: 30 1 1 0
source to share
Here's a basic R solution for checking if pairs have been read. Someone can add one for data.table
if you absolutely need to use it:
books = c (1,2,3,1,1,2)
readers = c(30, 10, 20, 20, 10, 30)
bks = data.frame(books, readers)
cmb <- combn(unique(books), 2)
cmb <- t(cmb)
combos <- as.data.frame(cmb)
bktbl <- t(table(bks))
for (i in 1:nrow(bktbl)) {
x[i] <- sum(bktbl[i, cmb[i, 1]], bktbl[i, cmb[i, 2]])
combos$PairRead <- ifelse(x > 1,"yes", "no")
}
combos
V1 V2 PairRead
1 1 2 yes
2 1 3 yes
3 2 3 no
source to share