Select multiple cases by value
I have a data frame with multiple observations for each ID, for example:
edit : updated dataframe
df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4), V1=c("A","B","C","A","A","B","B","C","A"),
V2=rnorm(9))
> df
ID V1 V2
1 1 A 1.57707547
2 1 B -0.76022296
3 1 C -0.82693346
4 2 A 1.80888747
5 2 A -0.53173950
6 3 B -1.18705727
7 3 B 0.04325324
8 3 C -0.33361802
9 4 A -0.02358198
Now I want to select all rows for ID like this:
- If the identifier has observations for "A" then select these lines
- If the identifier has an observation for "B" then select these lines
- If the identifier has observations for "A" and "B", select only rows with "A"
In my example, I want to have this:
ID V1 V2
1 1 A 1.57707547
2 2 A 1.80888747
3 2 A -0.53173950
4 3 B -1.18705727
5 3 B 0.04325324
6 4 A -0.02358198
I would also like to see a solution dplyr
if applicable.
source to share
Here is one of the options with dplyr
. We are grouping by column "ID", filter
rows that have rows "A" or "B" do another one filter
to check the number of unique items in "V1" ( n_distinct(V1)
). If it is greater than 1 and element is "A", we select it ( n_distinct(V1)>1 & V1=='A'
) or select the entire length of unique elements as 1.
library(dplyr)
df %>%
group_by(ID) %>%
filter(V1 %in% c('A', 'B'))%>%
filter(n_distinct(V1)>1 & V1=='A'|n_distinct(V1)==1)
# ID V1 V2
#1 1 A 1.57707547
#2 2 A 1.80888747
#3 2 A -0.53173950
#4 3 B -1.18705727
#5 3 B 0.04325324
#6 4 A -0.02358198
Perhaps we can use a modified version using a single filter
. We check if the number of unique elements in 'V1' is greater than 1, and if none of the elements is "A" ( all(V1!='A')
), and if this element is "B", we select this row or the number of individual elements is greater than 1, and it has item "A", select this row, or the number of unique items is 1 and item is "A" or "B", select row.
df %>%
group_by(ID) %>%
filter(n_distinct(V1)>1 & all(V1 !='A') & V1=='B'|n_distinct(V1)>1 &
V1=='A' |n_distinct(V1)==1 & V1 %in% c('A', 'B') )
# ID V1 V2
#1 1 A 1.57707547
#2 2 A 1.80888747
#3 2 A -0.53173950
#4 3 B -1.18705727
#5 3 B 0.04325324
#6 4 A -0.02358198
Or a little more compact would be (Inspired from @MichaelChirico's post). We are grouping by 'ID' and filter
either V1 with 'A' or 'B' lines and no 'A' lines.
df %>%
group_by(ID) %>%
filter(V1=='A'|V1=='B'&!any(V1=='A'))
# ID V1 V2
#1 1 A 1.57707547
#2 2 A 1.80888747
#3 2 A -0.53173950
#4 3 B -1.18705727
#5 3 B 0.04325324
#6 4 A -0.02358198
source to share
As data.table
it can be easily done with the help of:
library(data.table); setDT(df)
df[df[,.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))],by=ID]$V1]
The inner call df
selects indices ( .I
) where either a) V1
is equal A
or b) V1
is equal B
, and there is no other element A
among V1
it ID
(i.e. it is done by
ID
); $V1
retrieves these indexes and returns them back to the outer one df
.
(We may be confusing what we are retrieving V1
because V1
there is a column in the original table called , but the retrieval V1
we are retrieving is different: to do this, consider this alternative where we call the result variable
df[df[,.(ind=.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))]),by=ID]$ind]
Here we'll name the index variable ind
, so we need to extract ind
instead V1
)
source to share