Counting unique values based on two columns with repeated rows, R data frame
I have an R data frame with the following format:
column1 column2
NA NA
1 A
1 A
1 A
NA NA
NA NA
2 B
2 B
NA NA
NA NA
3 A
3 A
3 A
df = structure(list(column1 = c(NA, 1L, 1L, 1L, NA, NA, 2L, 2L, NA,
NA, 3L, 3L, 3L), column2 = c(NA, "A", "A", "A", NA, NA, "B",
"B", NA, NA, "A", "A", "A")), .Names = c("column1", "column2"
), row.names = c(NA, -13L), class = "data.frame")
If a row in one column has NA
, the other column has NA
. The numerical value in column1
describes a unique group, for example. lines 2-4 have a group 1
. The column column2
describes the identity of this grouping. In this data block identifier is either A
, B
, C
or D
.
My goal is to count the number of IDs by group in the whole data frame: how many groups A exist, how many groups B, etc.
The correct output for this file is (so far) there are 2 groups A and 1 group B.
How would you calculate this?
For now, I would try something like this:
length(df[df$column2 == "B"]) ## outputs 2
but this is not true. If I were to concatenate column1
and column2
, take only the unique values 1A, 2B, 3A, I think I could count how many times each label from column2
occurs?
(If it's easier, I'm happy to use it data.table
for this task.)
source to share
The 'dplyr' package has simple functions for this
library(dplyr)
df %>%
filter(complete.cases(.) & !duplicated(.)) %>%
group_by(column2) %>%
summarize(count = n())
- Filter rows with NA
- Filter out duplicate lines; they represent people in the same group.
- Group by identification variable (column2)
- Count the number of unique groups (column1)
source to share
If you want to use data.table:
library(data.table)
setDT(df)
d <- df[!is.na(column1), list(n=.N), by=list(column2,column1)]
d <- d[, list(n=.N), by=list(column2)]
d
column2 n
1: A 2
2: B 1
Or more succinctly as a one-liner:
setDT(df)[!is.na(column1), .N, by = .(column2, column1)][, .N, by = column2]
source to share