Remove unique values from the list and keep only duplicates

Question

Remove unique values from the list and keep only duplicates

I want to run a list of ids and return a list of all ids that have occurred more than once. This is what I created that works:

singles = list(ids)
duplicates = []
while len(singles) > 0:
    elem = singles.pop()
    if elem in singles:
        duplicates.append(elem)

But the list of ids is likely to be quite long, and I realistically don't want a while loop to rely on an expensive len call if I can avoid it. (I could go the inelegant route and call len one time and then just decrease it each iteration, but I'd rather avoid that if I could).

+3

python

SuperBiasedMan Apr 24 15 at 10:26

source to share

4 answers

You can do it,

>>> ids = [1,2,3,2,3,5]
>>> set(i for i in ids if ids.count(i) > 1)
{2, 3}

+5

Avinash Raj Apr 24 15 at 10:29

source to share

I guess this will run faster:

occasions = {}
for id in ids:
    try:
        occasions[id] += 1
    except KeyError:
        occasions[id] = 0
result = [id for id in ids if occasions[id] > 1]

+1

ka2m Apr 24 15 at 10:32

source to share

If you don't need the order in which these ids are retrieved, an efficient approach would be to sort a step (which is O (N log (N))) followed by storing the ids followed by themselves (which is O (N)) ... So this approach is generic O (N log (N)).

-1

Arkanosis Apr 24 15 at 10:32

source to share

abarnert · Accepted Answer · 2015-04-24T10:32:34+0000

A smart way to do this is to use a data structure that makes it simple and efficient, for example Counter

:

>>> ids = [random.randrange(100) for _ in range(200)]
>>> from collections import Counter
>>> counts = Counter(ids)
>>> dupids = [id for id in ids if counts[id] > 1]

The construction Counter

takes O (N) time, as opposed to O (N log N) time to sort, or O (N ^ 2) to count each element from zero each time.

As a side note:

But the list of ids is likely to be quite long, and I really don't want the while loop to rely on an expensive call to len if I can avoid it.

len

not expensive. This is constant time and (at least for a list of built-in types list

), it's about as fast as a function can get into Python without doing anything at all.

The part of your code that is more expensive calls elem in singles

inside the loop - this means that for each element, you must compare it to potentially any other element that means square time.

Remove unique values ​​from the list and keep only duplicates

More articles:

Remove unique values from the list and keep only duplicates