Select multiple cases by value

Question

Select multiple cases by value

I have a data frame with multiple observations for each ID, for example:

edit : updated dataframe

df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4), V1=c("A","B","C","A","A","B","B","C","A"),
             V2=rnorm(9))

> df
  ID V1          V2
1  1  A  1.57707547
2  1  B -0.76022296
3  1  C -0.82693346
4  2  A  1.80888747
5  2  A -0.53173950
6  3  B -1.18705727
7  3  B  0.04325324
8  3  C -0.33361802
9  4  A -0.02358198

Now I want to select all rows for ID like this:

If the identifier has observations for "A" then select these lines
If the identifier has an observation for "B" then select these lines
If the identifier has observations for "A" and "B", select only rows with "A"

In my example, I want to have this:

    ID V1          V2
  1  1  A  1.57707547
  2  2  A  1.80888747
  3  2  A -0.53173950
  4  3  B -1.18705727
  5  3  B  0.04325324
  6  4  A -0.02358198

I would also like to see a solution dplyr

if applicable.

+3

r dataframe dplyr

spore234 08 Aug 15 at 13:08

source to share

2 answers

As data.table

it can be easily done with the help of:

library(data.table); setDT(df)
df[df[,.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))],by=ID]$V1]

The inner call df

selects indices ( .I

) where either a) V1

is equal A

or b) V1

is equal B

, and there is no other element A

among V1

it ID

(i.e. it is done by

ID

); $V1

retrieves these indexes and returns them back to the outer one df

.

(We may be confusing what we are retrieving V1

because V1

there is a column in the original table called , but the retrieval V1

we are retrieving is different: to do this, consider this alternative where we call the result variable

df[df[,.(ind=.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))]),by=ID]$ind]

Here we'll name the index variable ind

, so we need to extract ind

instead V1

)

+2

MichaelChirico 08 Aug 15 at 14:29

source to share

akrun · Accepted Answer · 2015-08-08T13:13:33+0000

Here is one of the options with dplyr

. We are grouping by column "ID", filter

rows that have rows "A" or "B" do another one filter

to check the number of unique items in "V1" ( n_distinct(V1)

). If it is greater than 1 and element is "A", we select it ( n_distinct(V1)>1 & V1=='A'

) or select the entire length of unique elements as 1.

 library(dplyr)
 df %>% 
    group_by(ID) %>% 
    filter(V1 %in% c('A', 'B'))%>%
    filter(n_distinct(V1)>1 & V1=='A'|n_distinct(V1)==1)
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Perhaps we can use a modified version using a single filter

. We check if the number of unique elements in 'V1' is greater than 1, and if none of the elements is "A" ( all(V1!='A')

), and if this element is "B", we select this row or the number of individual elements is greater than 1, and it has item "A", select this row, or the number of unique items is 1 and item is "A" or "B", select row.

 df %>% 
   group_by(ID) %>% 
   filter(n_distinct(V1)>1 & all(V1 !='A') & V1=='B'|n_distinct(V1)>1 & 
            V1=='A' |n_distinct(V1)==1 & V1 %in% c('A', 'B') )
 #   ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Or a little more compact would be (Inspired from @MichaelChirico's post). We are grouping by 'ID' and filter

either V1 with 'A' or 'B' lines and no 'A' lines.

 df %>%
     group_by(ID) %>%
     filter(V1=='A'|V1=='B'&!any(V1=='A'))
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Select multiple cases by value

More articles: