Remove unique values ​​from the list and keep only duplicates

I want to run a list of ids and return a list of all ids that have occurred more than once. This is what I created that works:

singles = list(ids)
duplicates = []
while len(singles) > 0:
    elem = singles.pop()
    if elem in singles:
        duplicates.append(elem)

      

But the list of ids is likely to be quite long, and I realistically don't want a while loop to rely on an expensive len call if I can avoid it. (I could go the inelegant route and call len one time and then just decrease it each iteration, but I'd rather avoid that if I could).

+3


source to share


4 answers


A smart way to do this is to use a data structure that makes it simple and efficient, for example Counter

:

>>> ids = [random.randrange(100) for _ in range(200)]
>>> from collections import Counter
>>> counts = Counter(ids)
>>> dupids = [id for id in ids if counts[id] > 1]

      

The construction Counter

takes O (N) time, as opposed to O (N log N) time to sort, or O (N ^ 2) to count each element from zero each time.




As a side note:

But the list of ids is likely to be quite long, and I really don't want the while loop to rely on an expensive call to len if I can avoid it.

len

not expensive. This is constant time and (at least for a list of built-in types list

), it's about as fast as a function can get into Python without doing anything at all.

The part of your code that is more expensive calls elem in singles

inside the loop - this means that for each element, you must compare it to potentially any other element that means square time.

+12


source


You can do it,



>>> ids = [1,2,3,2,3,5]
>>> set(i for i in ids if ids.count(i) > 1)
{2, 3}

      

+5


source


I guess this will run faster:

occasions = {}
for id in ids:
    try:
        occasions[id] += 1
    except KeyError:
        occasions[id] = 0
result = [id for id in ids if occasions[id] > 1]

      

+1


source


If you don't need the order in which these ids are retrieved, an efficient approach would be to sort a step (which is O (N log (N))) followed by storing the ids followed by themselves (which is O (N)) ... So this approach is generic O (N log (N)).

-1


source







All Articles