How do I find and remove duplicates in data frames?

Question

How do I find and remove duplicates in data frames?

I have a data data frame that is a draft of the NBA data:

 draft_year draft_round teamid playerid draft_from
 1961           1         Bos    Pol1      Nan
 2001           1         LA     Ben2      Cal
 1967           2         Min    Mac2      Nan
 2001           1         LA     Ben2      Cal
 2000           1         C      Sio1      Bud
 2000           1         C      Gio1      Bud

I would like to find and remove only those duplicate lines in the playerid. For obvious reasons, the remaining duplicates have a meaningful purpose and should be preserved.

+3

r

user3833190 08 Sep 14 at 21:59

source to share

3 answers

You can achieve this using duplicated

orunique()

new_df <- df[!duplicated( df$playerid), ]

+2

RP- 08 Sep 14 at 22:05

source to share

You can also use dplyr

library(dplyr)
 unique(df, group_by="playerid")
 #  draft_year draft_round teamid playerid draft_from
 #1       1961           1    Bos     Pol1        Nan
 #2       2001           1     LA     Ben2        Cal
 #3       1967           2    Min     Mac2        Nan
 #5       2000           1      C     Sio1        Bud
 #6       2000           1      C     Gio1        Bud

or

 df %>%
 group_by(playerid) %>%
 filter(row_number()==1)

+2

akrun 09 Sep '14 at 7:36

source to share

David Arenburg · Accepted Answer · 2014-09-08T22:06:42+0000

In a package data.table

you have a parameter by

in a functionunique

library(data.table)
unique(setDT(df), by = "playerid")
#    draft_year draft_round teamid playerid draft_from
# 1:       1961           1    Bos     Pol1        Nan
# 2:       2001           1     LA     Ben2        Cal
# 3:       1967           2    Min     Mac2        Nan
# 4:       2000           1      C     Sio1        Bud
# 5:       2000           1      C     Gio1        Bud

How do I find and remove duplicates in data frames?

More articles: