Optimize dictionary creation
I have a list with IDs ids
. Each item in ids
is a string. One id
may exist multiple times on this list.
My goal is to create a dictionary that has a number of occurrences as a key, and the value is a list of identifiers that appear frequently. My current approach looks like this:
from collections import defaultdict
import numpy as np
ids = ["foo", "foo", "bar", "hi", "hi"]
counts = defaultdict(list)
for id in np.unique(ids):
counts[ids.count(id)].append(id)
Output:
print counts
--> defaultdict(<type 'list'>, {1: ['bar'], 2: ['foo', 'hi']})
This works well if the list of identifiers is not too long. However, for longer lists, the performance is pretty poor.
How can I make it faster?
source to share
Instead of calling count
for every item in the list, create collections.Counter
for the entire list:
ids = ["foo", "foo", "bar", "hi", "hi"]
counts = defaultdict(list)
for i, c in Counter(ids).items():
counts[c].append(i)
# counts: defaultdict(<class 'list'>, {1: ['bar'], 2: ['foo', 'hi']})
If you prefer a single liner, you can also combine Counter.most_common
(to view items sorted by invoice) and itertools.groupby
(but I wouldn't)
>>> {k: [v[0] for v in g] for k, g in groupby(Counter(ids).most_common(), lambda x: x[1])}
{1: ['bar'], 2: ['foo', 'hi']}
source to share