Conditionally remove rows that match column A based on column B in R using data.table

Trying to fix deduplication problem using data.table in R.

Column A is a list of names, some of which appear multiple times. Column B is a list of dates. There are a few more columns I want to copy, too (things that happened with the name on the date.)

However, I only want to look at the activity for each person in the new datatable, which has 1 record for each name that matches the most recent date.

Sample data

    name.last       date
 1:     Adams 2014-10-20
 2:     Adams 2014-07-07
 3:   Barnett 2014-11-06
 4:   Barnett 2014-09-22
 5:      Bell 2014-10-22
 6:      Bell 2014-07-29
 7:     Burns 2014-09-08
 8:     Burns 2014-09-03
 9:   Camacho 2014-08-12
10:   Camacho 2014-07-08
11:  Casillas 2014-10-07
12:  Casillas 2014-07-17
13:    Chavez 2014-09-23
14:    Chavez 2014-09-17
15:   Chavira 2014-07-15
16:   Chavira 2014-07-07
17:    Claren 2014-10-30
18:    Claren 2014-10-23
19:  Colleary 2014-11-11
20:  Colleary 2014-11-07

      

The answer will only return the first of each name (since here, the rows are sorted with the most recent date for each first.) However, if I set the dt key setkey(dt,name.last)

to use unique()

to remove duplicates, it reorders the table in key order (alphabetically by name ). Usage unique(dt)

then returns the first occurrence of each name, which is not necessarily the most recent date.

If I set the key on both columns setkeyv(dt,c(name.last,date))

, I cannot remove duplicates with unique()

, since all keys are unique.

The problem is similar to one post here: Dropping a data frame by selecting one row for each group . However, I cannot assume that the selected data will be the first or the last, unless you can suggest a way to manipulate my data to do it after the key is installed.

+3


source to share


3 answers


There are many ways to do this without ordering the data table (although ordering is preferable because it is duplicated

very efficient and you also avoid using by

- get to that).

First of all, you have to make sure you date

have a class date

to make it easier

dt[, date := as.Date(date)]

      

First simple method (although not the most efficient)

dt[, max(date), name.last]
#     name.last         V1
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

      

The second (suggested) method is similar to yours, but uses data.tables setorder

(for data.table

version> = 1.9.4) and should be most efficient

setorder(dt, name.last, -date)[!duplicated(name.last)]
#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

      



You can achieve the same by using setkey

(as you have already done) by specifying from.last = TRUE

in duplicated

and removing!

setkey(dt, name.last, date)[duplicated(name.last, from.last = TRUE)]

#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

      

The third method uses a function data.table

unique

(which should also be very efficient)

unique(setorder(dt, name.last, -date), by = "name.last")
#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

      

The last method uses .SD

. This is the least efficient, but useful in some cases where you want to get the entire column in reverse order and you cannot use functions like sduplicated

setorder(dt, name.last, -date)[, .SD[1], name.last]
#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

      

+3


source


If I understand your question, I think you can do this more cleanly with the sqldf package, but the downside is that you have to know sql.

install.packages("sqldf")
library("sqldf")
dt <-data.frame(read.table(header = TRUE, text = " name.last       date
1:     Adams 2014-10-20
2:     Adams 2014-07-07
3:   Barnett 2014-11-06
4:   Barnett 2014-09-22
5:      Bell 2014-10-22
6:      Bell 2014-07-29
7:     Burns 2014-09-08
8:     Burns 2014-09-03
9:   Camacho 2014-08-12
10:   Camacho 2014-07-08
11:  Casillas 2014-10-07
12:  Casillas 2014-07-17
13:    Chavez 2014-09-23
14:    Chavez 2014-09-17
15:   Chavira 2014-07-15
16:   Chavira 2014-07-07
17:    Claren 2014-10-30
18:    Claren 2014-10-23
19:  Colleary 2014-11-11
20:  Colleary 2014-11-07")
)
head(dt)
colnames(dt) <- c('names', 'date')
sqldf("select names, min(date), max(date) from dt group by names")

      



Hope this was helpful.

+2


source


In writing, I figured it out. For posterity ....

Order the table by name and date so that you can depend on the date when you want to be first or last in the group. For example: dt[order(names,-date)]

.

Then, instead of setting the key and using unique()

, just:

dt[!duplicated(names)]

Where names

is the duplicated column.

Should output the desired table. If there are more elegant / reliable ways to do this, I'd be interested to hear them.

+1


source







All Articles