Removing duplicates from a data frame very quickly
I have a data.frame with 50,000 rows, with some duplicates that I would like to remove.
SYMBOL INTENSITY CALL VALUE
1 DDR1 2464.3023 P 0.00016023613
2 RFC2 496.5190 P 0.0193034606
3 HSPA6 733.4763 P 0.0008046637
4 PAX8 2138.2882 P 0.0005617505
5 GUCA1A 176.3272 A 0.1896873022
6 UBA7 729.6157 P 0.0170004527
7 DDR1 2464.3023 P 0.0016023613
8 RFC2 496.5190 P 0.0593034606
9 HSPA9 733.4763 P 0.0008046637
10 PAX8 2138.2882 P 0.15617505
11 GUCA1A2 176.3272 A 0.01896873022
12 UBA8 729.6157 P 0.0170004527
I used the following one to remove duplicates. In the fourth column, I saved the one with the minimum "VALUE".
dt <- data.table(df)
WoDuplic <- dt[,.SD[which.min(VALUE)],by=list(SYMBOL)]
It serves a purpose but is very slow, it takes about 10 seconds to remove duplicates to remove duplicates from the dataframe above. Is there a way to speed up the process?
Edited: Result looks like
SYMBOL INTENSITY CALL VALUE
1 DDR1 2464.3023 P 0.00016023613
2 RFC2 496.5190 P 0.0193034606
3 HSPA6 733.4763 P 0.0008046637
4 PAX8 2138.2882 P 0.0005617505
5 GUCA1A 176.3272 A 0.1896873022
6 UBA7 729.6157 P 0.0170004527
9 HSPA9 733.4763 P 0.0008046637
11 GUCA1A2 176.3272 A 0.01896873022
12 UBA8 729.6157 P 0.0170004527
source to share
We could get the index of the rows ( .I[which.min(..)
) with the minimum "VALUE" for each "SYMBOL" and use that column ("V1") for a subset of the dataset.
library(data.table)
dt[dt[,.I[which.min(VALUE)],by=list(SYMBOL)]$V1]
Or, as @DavidArenburg mentioned, using setkey
would be more efficient (although I'm not sure why you are getting an error with the original data)
setkey(dt, VALUE)
indx <- dt[,.I[1L], by = SYMBOL]$V1
dt[indx]
source to share
You can use aggregate and merge to solve the problem. It should be very fast.
create example data.frame
set.seed(123)
df <- data.frame(gene=rep(letters[1:20],2500),INTENSITY=1:50000,value=runif(50000))
get the minimum value for each gene
mins <- aggregate(value ~ gene, data = df, FUN = min)
and concatenate additional columns
df.min <- merge(mins, df)
source to share