Removing duplicate lines in C ++

Question

Removing duplicate lines in C ++

I have a huge text file with the following structure:

The first and second columns indicate the object ID, and the last column indicates the frequency of the pair. In the above example, a pair with objects 001

and 002

occurs twice:

001 002 3
002 001 3

My question is: What is the most suitable (and efficient) way to remove duplicate rows? What programming structure should I use?

+3

c ++ algorithm

Andrej 01 dec. 14 at 12:12

source to share

2 answers

An unordered_map

hashing the numbers into something that makes the order unimportant, I suppose.

Read both IDs
Sort them
Combine them into a hashed key

The card value is your third column, of course.

If you need to keep the original order (I think you won't) then using a regular map and concatenating id's without hashing is still an option.

+3

Bartek banachewicz 01 dec. 14 at 12:15

source to share

rubenvb · Accepted Answer · 2014-12-01T12:17:00+0000

This is how I would approach the problem:

On the input, sort the first two lines so that the same line appears in 001 002

and 002 001

.
Save all of these modified lines in a container suitable for unique items such as std::set

or std::unordered_set

. You have a duplicate if the operation insert

fails.

You can, of course, parse the input string before processing and treat them as numbers. But this decision can only be made if more information is available about how the data will be used.

Removing duplicate lines in C ++

More articles: