Select multiple cases by value

I have a data frame with multiple observations for each ID, for example:

edit : updated dataframe

df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4), V1=c("A","B","C","A","A","B","B","C","A"),
             V2=rnorm(9))

> df
  ID V1          V2
1  1  A  1.57707547
2  1  B -0.76022296
3  1  C -0.82693346
4  2  A  1.80888747
5  2  A -0.53173950
6  3  B -1.18705727
7  3  B  0.04325324
8  3  C -0.33361802
9  4  A -0.02358198

      

Now I want to select all rows for ID like this:

  • If the identifier has observations for "A" then select these lines
  • If the identifier has an observation for "B" then select these lines
  • If the identifier has observations for "A" and "B", select only rows with "A"

In my example, I want to have this:

    ID V1          V2
  1  1  A  1.57707547
  2  2  A  1.80888747
  3  2  A -0.53173950
  4  3  B -1.18705727
  5  3  B  0.04325324
  6  4  A -0.02358198

      

I would also like to see a solution dplyr

if applicable.

+3


source to share


2 answers


Here is one of the options with dplyr

. We are grouping by column "ID", filter

rows that have rows "A" or "B" do another one filter

to check the number of unique items in "V1" ( n_distinct(V1)

). If it is greater than 1 and element is "A", we select it ( n_distinct(V1)>1 & V1=='A'

) or select the entire length of unique elements as 1.

 library(dplyr)
 df %>% 
    group_by(ID) %>% 
    filter(V1 %in% c('A', 'B'))%>%
    filter(n_distinct(V1)>1 & V1=='A'|n_distinct(V1)==1)
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

      

Perhaps we can use a modified version using a single filter

. We check if the number of unique elements in 'V1' is greater than 1, and if none of the elements is "A" ( all(V1!='A')

), and if this element is "B", we select this row or the number of individual elements is greater than 1, and it has item "A", select this row, or the number of unique items is 1 and item is "A" or "B", select row.



 df %>% 
   group_by(ID) %>% 
   filter(n_distinct(V1)>1 & all(V1 !='A') & V1=='B'|n_distinct(V1)>1 & 
            V1=='A' |n_distinct(V1)==1 & V1 %in% c('A', 'B') )
 #   ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

      

Or a little more compact would be (Inspired from @MichaelChirico's post). We are grouping by 'ID' and filter

either V1 with 'A' or 'B' lines and no 'A' lines.

 df %>%
     group_by(ID) %>%
     filter(V1=='A'|V1=='B'&!any(V1=='A'))
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

      

+3


source


As data.table

it can be easily done with the help of:

library(data.table); setDT(df)
df[df[,.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))],by=ID]$V1]

      

The inner call df

selects indices ( .I

) where either a) V1

is equal A

or b) V1

is equal B

, and there is no other element A

among V1

it ID

(i.e. it is done by

ID

); $V1

retrieves these indexes and returns them back to the outer one df

.



(We may be confusing what we are retrieving V1

because V1

there is a column in the original table called , but the retrieval V1

we are retrieving is different: to do this, consider this alternative where we call the result variable

df[df[,.(ind=.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))]),by=ID]$ind]

      

Here we'll name the index variable ind

, so we need to extract ind

instead V1

)

+2


source







All Articles