How to remove duplicate lines by group?

Question

How to remove duplicate lines by group?

How to remove duplicate lines by group with the option to select the number of duplicate lines?

eg: Please check the example image, for each continuous 1 in V1

, remove the lines where is duplicated Volume

, for df[2:5,]

line 5 will be deleted, for df[9:10,]

line 9 will be deleted, df[15:17,]

line 15,16 will be deleted, df[19:20,]

line 19 will be deleted.

Also, is it possible to choose the number of duplicate rows? eg: if I want to keep two duplicate lines, the result for df[15:17,]

would be df[15:16,]

where only line 17 is removed.

How do I achieve this without using loops, how do I achieve this in a vectorized way so that the computation speed is faster (when dealing with millions of lines)?

Sample image

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8573 0.8585  1  0
 4: 1.1457 1.1413  1  0
 5: 0.8573 0.8568  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
 9: 1.4850 1.3412  1  0
10: 1.4850 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
15: 1.1859 1.1748  1  0
16: 1.1859 1.1735  1  0
17: 1.1859 1.1731  1  0
18: 1.1557 1.1552  0  1
19: 1.1749 1.1731  1  0
20: 1.1749 1.1552  1  0

Expected Result

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8573 0.8585  1  0
 4: 1.1457 1.1413  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
10: 1.4850 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
17: 1.1859 1.1731  1  0
18: 1.1557 1.1552  0  1
20: 1.1749 1.1552  1  0

+3

vectorization r rstudio

Jimmy May 29 '17 at 3:31

source to share

1 answer

akrun · Accepted Answer · 2017-05-29T03:39:35+0000

We can use duplicated

setDT(df1)[df1[, (!duplicated(Volume) & V1==1)|V1==0, rleid(V1)]$V1]

If we need to remove from a duplicate in the opposite direction

setDT(df1)[df1[, (!duplicated(Volume, fromLast = TRUE) & V1==1)|V1==0, rleid(V1)]$V1]

How to remove duplicate lines by group?

More articles: