How to remove duplicate lines by group?

How to remove duplicate lines by group with the option to select the number of duplicate lines?

eg: Please check the example image, for each continuous 1 in V1

, remove the lines where is duplicated Volume

, for df[2:5,]

line 5 will be deleted, for df[9:10,]

line 9 will be deleted, df[15:17,]

line 15,16 will be deleted, df[19:20,]

line 19 will be deleted.

Also, is it possible to choose the number of duplicate rows? eg: if I want to keep two duplicate lines, the result for df[15:17,]

would be df[15:16,]

where only line 17 is removed.

How do I achieve this without using loops, how do I achieve this in a vectorized way so that the computation speed is faster (when dealing with millions of lines)?

Sample image

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8573 0.8585  1  0
 4: 1.1457 1.1413  1  0
 5: 0.8573 0.8568  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
 9: 1.4850 1.3412  1  0
10: 1.4850 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
15: 1.1859 1.1748  1  0
16: 1.1859 1.1735  1  0
17: 1.1859 1.1731  1  0
18: 1.1557 1.1552  0  1
19: 1.1749 1.1731  1  0
20: 1.1749 1.1552  1  0

      

Expected Result

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8573 0.8585  1  0
 4: 1.1457 1.1413  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
10: 1.4850 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
17: 1.1859 1.1731  1  0
18: 1.1557 1.1552  0  1
20: 1.1749 1.1552  1  0

      

+3


source to share


1 answer


We can use duplicated

setDT(df1)[df1[, (!duplicated(Volume) & V1==1)|V1==0, rleid(V1)]$V1]

      




If we need to remove from a duplicate in the opposite direction

setDT(df1)[df1[, (!duplicated(Volume, fromLast = TRUE) & V1==1)|V1==0, rleid(V1)]$V1]

      

+2


source







All Articles