Subset based on duplicate values in a row and a conditional column in R?

Question

Subset based on duplicate values in a row and a conditional column in R?

I have a dataset of campaign contributions. I want to extract the contributors who contributed in 2008 and 2012 while maintaining the type of election. I don't want to delete the letters before the years - P and G. For example, notice how AARONSON PAUL participated in P2008, G2008, and G2012:

cand_nm party_id    contbr_nm       contb_receipt_amt   election_tp
Obama Barack    D   AARONSON PAUL       250 P2008
Obama Barack    D   AARONSON PAUL       100 G2008
Obama Barack    D   AARONSON PAUL       500 G2008
Obama Barack    D   AARONSON PAUL       500 G2008
Obama Barack    D   AARONSON PAUL       500 G2012
Obama Barack    D   AARONSON PAUL       281 G2012
Obama Barack    D   AARONSON PAUL       219 G2012

I want to extract such authors. My dataset originally consisted of two, eight and twelve, before I combined them into 1. So I know that 21.6% contributed in 2008 contributed in 2012 because:

length(unique(data$contbr_nm)) = 192022

length(unique(eight$contbr_nm)) = 123792

length(unique(twelve$contbr_nm)) = 95005

((123792+95005)-192022)/(123792) = 0.2162902

But how do I actually multiply these authors. I feel like there must be a way to do this using ddply or sqldf.

+2

r

Sara May 27 '15 at 22:43

source to share

3 answers

Sam firke · Answer 1 · 2015-05-28T02:37:07+0000

Here's an approach with dplyr and tidyr :

library(dplyr)
library(tidyr)

data %>%
  mutate(year = extract_numeric(election_tp)) %>%
  group_by(contbr_nm) %>%
  filter(n_distinct(year) == 2)

It creates a new variable containing only the year value, and then filters only for members who have two different values for this new variable.

Chris · Answer 2 · 2015-05-28T02:36:13+0000

I created a quick example showing include / exclude

library(data.table)

data <- data.table(year = c("P2008","G2008","G2012", "G2008", "G2012"), name = c("x","x","x","yy","zz"), rnd = runif(5))
# You should use data <- data.table(data) here

setkey(data,name)

data[name == unique(merge(data[substr(year,2,5) == "2008"],data[substr(year,2,5) == "2012"])[["name"]])]

Aborting this, I am joining the data table of all 2008 values with all 2012 values that return all the names that are present in each year. Then I remove duplicates and multiply my original data tables with names in this unique list.

G. Grothendieck · Answer 3 · 2015-05-28T17:25:31+0000

It would be clearer if you provided a sample output, but on condition that you want to indicate which contributors contributed in 2008 and 2012:

sqldf SQL supports like

, where %

is wildcard:

library(sqldf)
sqldf("select distinct contbr_nm 
       from (select contbr_nm from DF where election_tp like '%2008')
       join (select contbr_nm from DF where election_tp like '%2012')
       using (contbr_nm)")

giving:

      contbr_nm
1 AARONSON PAUL

grepl / intersect / subset This can also be used with intersect

, grepl

and subset

. Packages not used:

intersect(subset(DF, grepl('2008', election_tp))$contbr_nm,
          subset(DF, grepl('2008', election_tp))$contbr_nm)

Note In the above example, we used this as DF

:

DF <-
structure(list(cand_nm = c("Obama Barack", "Obama Barack", "Obama Barack", 
"Obama Barack", "Obama Barack", "Obama Barack", "Obama Barack"
), party_id = c("D", "D", "D", "D", "D", "D", "D"), contbr_nm = c("AARONSON PAUL", 
"AARONSON PAUL", "AARONSON PAUL", "AARONSON PAUL", "AARONSON PAUL", 
"AARONSON PAUL", "AARONSON PAUL"), contb_receipt_amt = c(250L, 
100L, 500L, 500L, 500L, 281L, 219L), election_tp = c("P2008", 
"G2008", "G2008", "G2008", "G2012", "G2012", "G2012")), .Names = c("cand_nm", 
"party_id", "contbr_nm", "contb_receipt_amt", "election_tp"), class = "data.frame", row.names = c(NA, 
-7L))

Subset based on duplicate values ​​in a row and a conditional column in R?

More articles:

Subset based on duplicate values in a row and a conditional column in R?