R - group_by n_distinct to summarize

My dataset looks like this

library(dyplr) 

dta = rbind(c(1,'F', 0), 
  c(1,'F', 0), 
  c(1,'F', 0), 
  c(2,'F', 1), 
  c(2,'F', 1), 
  c(3,'F', 1), 
  c(3,'F', 1), 
  c(3,'F', 1), 
  c(4,'M', 1), 
  c(4,'M', 1), 
  c(5,'M', 1), 
  c(6,'M', 0)
)

colnames(dta) <- c('id', 'sex', 'child')
dta = as.data.frame(dta)

      

Thus, the data is long format with id as a personal identifier.

My problem is when I try to count sex , for example I do not have the correct count due to repeating id .

So there are 3 women and 3 men.

but when i believe i have

dta %>% 
  group_by(sex) %>% 
  summarise(n())

      

8 and 4 - because it counted rows, not a unique id

Similar issue with crosstab

dta %>% 
  group_by(sex, child) %>% 
  summarise(n())

      

How to specify a unique identifier ( n_distinct

) in the invoice?

+3


source to share


2 answers


There are several ways to do this, here is one:

dta %>% distinct(id) %>%
        group_by(sex) %>%
        summarise(n())

      

EDIT: After some discussion, let me check out how swift variable methods work.

First, some larger data:

dta <- data.frame(id = rep(1:500, 30),
                  sex = rep (c("M", "F"), 750),
                  child = rep(c(1, 0, 0, 1), 375))

      

Now, run our various methods:



library(microbenchmark)

microbenchmark(
    distinctcount = dta %>% distinct(id) %>% count(sex),
    uniquecount = dta %>% unique %>% count(sex),
    distinctsummarise = dta %>% distinct(id) %>% group_by(sex) %>% summarise(n()),
    uniquesummarise = dta %>% unique %>% group_by(sex) %>% summarise(n()),
    distincttally= dta %>% distinct(id) %>% group_by(sex) %>% tally
)

      

On my machine:

Unit: milliseconds
              expr       min        lq      mean    median        uq       max neval
     distinctcount  1.576307  1.602803  1.664385  1.630643  1.670195  2.233710   100
       uniquecount 32.391659 32.885479 33.194082 33.072485 33.244516 35.734735   100
 distinctsummarise  1.724914  1.760817  1.815123  1.792114  1.830513  2.178798   100
   uniquesummarise 32.757609 33.080933 33.490001 33.253155 33.463010 39.937194   100
     distincttally  1.618547  1.656947  1.715741  1.685554  1.731058  2.383084   100

      

We can see that unique works pretty bad on big data, so the fastest is:

dta %>% distinct(id) %>% count(sex)

      

+3


source


Basic package:

aggregate(id ~ sex, dta, function(x) length(unique(x))) 

      

Output:

  sex id
1   F  3
2   M  3

      

Another alternative with dplyr

:

library(dplyr) 
count_(unique(dta), vars = "sex") 

      



Output:

Source: local data frame [2 x 2]

  sex n
1   F 3
2   M 3

      

Using sqldf

:

library(sqldf)
sqldf("SELECT sex, COUNT(DISTINCT(id)) AS n 
      FROM dta GROUP BY sex")

      

Output:

  sex n
1   F 3
2   M 3

      

+1


source







All Articles