Removing duplicate lines in C ++
I have a huge text file with the following structure:
001 002 3
001 003 4
002 001 3
002 005 2
...
The first and second columns indicate the object ID, and the last column indicates the frequency of the pair. In the above example, a pair with objects 001
and 002
occurs twice:
001 002 3
002 001 3
My question is: What is the most suitable (and efficient) way to remove duplicate rows? What programming structure should I use?
source to share
This is how I would approach the problem:
- On the input, sort the first two lines so that the same line appears in
001 002
and002 001
. - Save all of these modified lines in a container suitable for unique items such as
std::set
orstd::unordered_set
. You have a duplicate if the operationinsert
fails.
You can, of course, parse the input string before processing and treat them as numbers. But this decision can only be made if more information is available about how the data will be used.
source to share
An unordered_map
hashing the numbers into something that makes the order unimportant, I suppose.
- Read both IDs
- Sort them
- Combine them into a hashed key
The card value is your third column, of course.
If you need to keep the original order (I think you won't) then using a regular map and concatenating id's without hashing is still an option.
source to share