Counting unique values based on two columns with repeated rows, R data frame

Question

Counting unique values based on two columns with repeated rows, R data frame

I have an R data frame with the following format:

column1    column2
NA         NA
1          A
1          A
1          A
NA         NA
NA         NA
2          B
2          B
NA         NA
NA         NA
3          A
3          A
3          A

df = structure(list(column1 = c(NA, 1L, 1L, 1L, NA, NA, 2L, 2L, NA, 
NA, 3L, 3L, 3L), column2 = c(NA, "A", "A", "A", NA, NA, "B", 
"B", NA, NA, "A", "A", "A")), .Names = c("column1", "column2"
), row.names = c(NA, -13L), class = "data.frame")

If a row in one column has NA

, the other column has NA

. The numerical value in column1

describes a unique group, for example. lines 2-4 have a group 1

. The column column2

describes the identity of this grouping. In this data block identifier is either A

, B

, C

or D

.

My goal is to count the number of IDs by group in the whole data frame: how many groups A exist, how many groups B, etc.

The correct output for this file is (so far) there are 2 groups A and 1 group B.

How would you calculate this?

For now, I would try something like this:

length(df[df$column2 == "B"]) ## outputs 2

but this is not true. If I were to concatenate column1

and column2

, take only the unique values 1A, 2B, 3A, I think I could count how many times each label from column2

occurs?

(If it's easier, I'm happy to use it data.table

for this task.)

+3

r count dataframe data.table

ShanZhengYang 11 Apr 17 at 19:45

source to share

3 answers

The 'dplyr' package has simple functions for this

library(dplyr)

df %>%
  filter(complete.cases(.) & !duplicated(.)) %>% 
  group_by(column2) %>%
  summarize(count = n())

Filter rows with NA
Filter out duplicate lines; they represent people in the same group.
Group by identification variable (column2)
Count the number of unique groups (column1)

+3

rsoren 11 Apr 17 at 19:58

source to share

If you want to use data.table:

library(data.table)
setDT(df)

d <- df[!is.na(column1), list(n=.N), by=list(column2,column1)]
d <- d[, list(n=.N), by=list(column2)]
d
   column2 n
1:       A 2
2:       B 1

Or more succinctly as a one-liner:

setDT(df)[!is.na(column1), .N, by = .(column2, column1)][, .N, by = column2]

+3

COLO 11 Apr 17 at 20:03

source to share

Frank · Accepted Answer · 2017-04-11T19:56:40+0000

You can use for feeds rle

and table

tabs:

table(rle(df$column2)$values)

# A B 
# 2 1

See ?rle

and for more details ?table

.

Or, if you want to take advantage of column1

(which is derived from column2

):

table(unique(df)$column2)

Counting unique values ​​based on two columns with repeated rows, R data frame

More articles:

Counting unique values based on two columns with repeated rows, R data frame