Saving only the largest groups with data.table

I recently started using the data.table package in R, but I recently stumbled upon a problem that I don't know how to deal with data.table.

Sample data:

set.seed(1)
library(data.table)
dt = data.table(group=c("A","A","A","B","B","B","C","C"),value = runif(8))

      

I can add a group account with a statement

dt[,groupcount := .N ,group]

      

but now I only want to keep the groups x with the highest value for groupcount

. Let's assume x=1

for example.

I tried chaining like this:

dt[,groupcount := .N ,group][groupcount %in% head(sort(unique(groupcount),decreasing=TRUE),1)]

      

But since groups A and B have three items, they both remain in the data table. I only need the largest groups x where x = 1, so I want one of the groups (A or B) to remain. I guess it can be done in one line with data.table. Is this true, and if so, how?


To clarify: x is an arbitrarily chosen number. The function should also work with x = 3, where it will return the 3 largest groups.

+3


source to share


3 answers


How about using order groupcount



setorder(dt, -groupcount)

x <- 1   
dt[group %in% dt[ , unique(group)][1:x] ]

#   group     value groupcount
# 1:     A 0.2655087          3
# 2:     A 0.3721239          3
# 3:     A 0.5728534          3


x <- 3
dt[group %in% dt[ , unique(group)][1:x] ]


#     group     value groupcount
# 1:     A 0.2655087          3
# 2:     A 0.3721239          3
# 3:     A 0.5728534          3
# 4:     B 0.9082078          3
# 5:     B 0.2016819          3
# 6:     B 0.8983897          3
# 7:     C 0.9446753          2
# 8:     C 0.6607978          2

## alternative syntax
# dt[group %in% unique(dt$group)[1:x] ]

      

+2


source


Here is a method that uses a connection.

x <- 1

dt[dt[, .N, by=group][order(-N)[1:x]], on="group"]
   group     value N
1:     A 0.2655087 3
2:     A 0.3721239 3
3:     A 0.5728534 3

      



The inner data.frame is aggregated to count the observations, and the position of the largest x groups is retrieved using the subset order

using the x value. Then the resulting data frame is connected to the original by group.

+3


source


We can do it

x <- 1
dt[dt[, {tbl <- table(group)
         nm <- names(tbl)[tbl==max(tbl)]
        if(length(nm) < x) rep(TRUE, .N)
        else group %in% sample(names(tbl)[tbl==max(tbl)], x)}]]

      

+2


source







All Articles