Optimize dictionary creation

I have a list with IDs ids

. Each item in ids

is a string. One id

may exist multiple times on this list.

My goal is to create a dictionary that has a number of occurrences as a key, and the value is a list of identifiers that appear frequently. My current approach looks like this:

from collections import defaultdict
import numpy as np
ids = ["foo", "foo", "bar", "hi", "hi"]
counts = defaultdict(list)
for id in np.unique(ids):
    counts[ids.count(id)].append(id)

      

Output:

print counts
--> defaultdict(<type 'list'>, {1: ['bar'], 2: ['foo', 'hi']})

      

This works well if the list of identifiers is not too long. However, for longer lists, the performance is pretty poor.

How can I make it faster?

+3


source to share


1 answer


Instead of calling count

for every item in the list, create collections.Counter

for the entire list:

ids = ["foo", "foo", "bar", "hi", "hi"]
counts = defaultdict(list)
for i, c in Counter(ids).items():
    counts[c].append(i)
# counts: defaultdict(<class 'list'>, {1: ['bar'], 2: ['foo', 'hi']})

      




If you prefer a single liner, you can also combine Counter.most_common

(to view items sorted by invoice) and itertools.groupby

(but I wouldn't)

>>> {k: [v[0] for v in g] for k, g in groupby(Counter(ids).most_common(), lambda x: x[1])}
{1: ['bar'], 2: ['foo', 'hi']}

      

+4


source







All Articles