How to remove duplicate lines by group?
How to remove duplicate lines by group with the option to select the number of duplicate lines?
eg: Please check the example image, for each continuous 1 in V1
, remove the lines where is duplicated Volume
, for df[2:5,]
line 5 will be deleted, for df[9:10,]
line 9 will be deleted, df[15:17,]
line 15,16 will be deleted, df[19:20,]
line 19 will be deleted.
Also, is it possible to choose the number of duplicate rows? eg: if I want to keep two duplicate lines, the result for df[15:17,]
would be df[15:16,]
where only line 17 is removed.
How do I achieve this without using loops, how do I achieve this in a vectorized way so that the computation speed is faster (when dealing with millions of lines)?
Sample image
Volume Weight V1 V2
1: 0.5367 0.5367 0 1
2: 0.8645 0.8508 1 0
3: 0.8573 0.8585 1 0
4: 1.1457 1.1413 1 0
5: 0.8573 0.8568 1 0
6: 0.5694 0.5633 0 1
7: 1.2368 1.2343 1 0
8: 0.9662 0.9593 0 1
9: 1.4850 1.3412 1 0
10: 1.4850 1.3995 1 0
11: 1.1132 1.1069 0 1
12: 1.4535 1.3923 1 0
13: 1.0437 1.0344 0 1
14: 1.1475 1.1447 0 1
15: 1.1859 1.1748 1 0
16: 1.1859 1.1735 1 0
17: 1.1859 1.1731 1 0
18: 1.1557 1.1552 0 1
19: 1.1749 1.1731 1 0
20: 1.1749 1.1552 1 0
Expected Result
Volume Weight V1 V2
1: 0.5367 0.5367 0 1
2: 0.8645 0.8508 1 0
3: 0.8573 0.8585 1 0
4: 1.1457 1.1413 1 0
6: 0.5694 0.5633 0 1
7: 1.2368 1.2343 1 0
8: 0.9662 0.9593 0 1
10: 1.4850 1.3995 1 0
11: 1.1132 1.1069 0 1
12: 1.4535 1.3923 1 0
13: 1.0437 1.0344 0 1
14: 1.1475 1.1447 0 1
17: 1.1859 1.1731 1 0
18: 1.1557 1.1552 0 1
20: 1.1749 1.1552 1 0
source to share