Removing duplicates from a data frame very quickly

Question

Removing duplicates from a data frame very quickly

I have a data.frame with 50,000 rows, with some duplicates that I would like to remove.

    SYMBOL          INTENSITY CALL          VALUE
1     DDR1          2464.3023    P  0.00016023613
2     RFC2           496.5190    P   0.0193034606
3    HSPA6           733.4763    P   0.0008046637
4     PAX8          2138.2882    P   0.0005617505
5   GUCA1A           176.3272    A   0.1896873022
6     UBA7           729.6157    P   0.0170004527
7     DDR1          2464.3023    P   0.0016023613
8     RFC2           496.5190    P   0.0593034606
9    HSPA9           733.4763    P   0.0008046637
10    PAX8          2138.2882    P     0.15617505
11 GUCA1A2           176.3272    A  0.01896873022
12    UBA8           729.6157    P   0.0170004527

I used the following one to remove duplicates. In the fourth column, I saved the one with the minimum "VALUE".

dt <- data.table(df)   
WoDuplic <- dt[,.SD[which.min(VALUE)],by=list(SYMBOL)]

It serves a purpose but is very slow, it takes about 10 seconds to remove duplicates to remove duplicates from the dataframe above. Is there a way to speed up the process?

Edited: Result looks like

SYMBOL          INTENSITY CALL          VALUE
1     DDR1          2464.3023    P  0.00016023613
2     RFC2           496.5190    P   0.0193034606
3    HSPA6           733.4763    P   0.0008046637
4     PAX8          2138.2882    P   0.0005617505
5   GUCA1A           176.3272    A   0.1896873022
6     UBA7           729.6157    P   0.0170004527


9    HSPA9           733.4763    P   0.0008046637

11 GUCA1A2           176.3272    A  0.01896873022
12    UBA8           729.6157    P   0.0170004527

+3

r duplicates

Aaghaz hussain May 07 '15 at 6:42

source to share

2 answers

You can use aggregate and merge to solve the problem. It should be very fast.

create example data.frame

set.seed(123)
df <- data.frame(gene=rep(letters[1:20],2500),INTENSITY=1:50000,value=runif(50000))

get the minimum value for each gene

mins <- aggregate(value ~ gene, data = df, FUN = min)

and concatenate additional columns

df.min <- merge(mins, df)

0

Jimbou May 07 '15 at 7:48

source to share

akrun · Accepted Answer · 2015-05-07T08:12:14+0000

We could get the index of the rows ( .I[which.min(..)

) with the minimum "VALUE" for each "SYMBOL" and use that column ("V1") for a subset of the dataset.

library(data.table)
dt[dt[,.I[which.min(VALUE)],by=list(SYMBOL)]$V1]

Or, as @DavidArenburg mentioned, using setkey

would be more efficient (although I'm not sure why you are getting an error with the original data)

 setkey(dt, VALUE) 
 indx <- dt[,.I[1L], by = SYMBOL]$V1 
 dt[indx]

Removing duplicates from a data frame very quickly

More articles: