Removing duplicate lines in C ++

I have a huge text file with the following structure:

001 002 3
001 003 4
002 001 3
002 005 2
...

      

The first and second columns indicate the object ID, and the last column indicates the frequency of the pair. In the above example, a pair with objects 001

and 002

occurs twice:

001 002 3
002 001 3

      

My question is: What is the most suitable (and efficient) way to remove duplicate rows? What programming structure should I use?

+3


source to share


2 answers


This is how I would approach the problem:

  • On the input, sort the first two lines so that the same line appears in 001 002

    and 002 001

    .
  • Save all of these modified lines in a container suitable for unique items such as std::set

    or std::unordered_set

    . You have a duplicate if the operation insert

    fails.


You can, of course, parse the input string before processing and treat them as numbers. But this decision can only be made if more information is available about how the data will be used.

+3


source


An unordered_map

hashing the numbers into something that makes the order unimportant, I suppose.

  • Read both IDs
  • Sort them
  • Combine them into a hashed key


The card value is your third column, of course.

If you need to keep the original order (I think you won't) then using a regular map and concatenating id's without hashing is still an option.

+3


source







All Articles