Python - grouped duplicates in a list of lists by index
I have seen many questions about removing duplicates from a list and counting them. But I'm trying to find the best way to group them - a list of lists.
In this example, I want to group by the third field:
[[1, "text", "name1", "text"],
[2, "text", "name2", "text"],
[3, "text", "name2", "text"],
[4, "text", "name1", "text"]]
I would like to receive the following:
[[[1, "text", "name1", "text"],
[4, "text", "name1", "text"]],
[[2, "text", "name2", "text"],
[3, "text", "name2", "text"]]]
I can think of a naive way of wading through and just tracking what I found (O (n ^ 2)). But I would suggest a better way.
source to share
You can sort and use groupby, but this O(n log n)
:
from operator import itemgetter
from itertools import groupby
print([list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))])
Or use OrderedDict grouping by the third item to solve O(n)
, using the third item as the key and adding the sublist as the values. setdefault will handle duplicate keys:
from collections import OrderedDict
od = OrderedDict()
for sub in l:
od.setdefault(sub[2],[]).append(sub)
from pprint import pprint as pp
pp(od.values())
[[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']],
[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]]
If order doesn't matter, you can use defaultdict instead of OrderedDict.
If order doesn't matter, then defaultdict is most efficient.
In [7]: from itertools import groupby
In [8]: from collections import OrderedDict, defaultdict
In [9]: l = [[1, "text", "name{}".format(choice(list(range(2000)))), "text"] for _ in xrange(40000)]
In [13]: from operator import itemgetter
In [14]: timeit [list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))]
10 loops, best of 3: 42.5 ms per loop
In [15]: %%timeit
od = defaultdict(list)
for sub in l:
od[sub[2]].append(sub)
....:
100 loops, best of 3: 9.42 ms per loop
In [16]: %%timeit
od = OrderedDict()
for sub in l:
od.setdefault(sub[2],[]).append(sub)
....:
10 loops, best of 3: 25.5 ms per loop
In [17]: lists = l
In [18]: %%timeit
....: groupers = set(l[2] for l in lists)
....: [filter(lambda x: x[2] == y, lists) for y in groupers]
....:
1 loops, best of 3: 8.48 s per loop
In [19]: timeit l = [filter(lambda x: x[2] == y, lists) for y in set(l[2] for l in lists)]
1 loops, best of 3: 8.29 s per loop
So if order doesn't matter, then defaultdict wins, groupby still performs pretty well, since sorting is still pretty cheap compared to the quadratic approach. As you can see, the squared complexity of the filter does not perform well as the data grows.
source to share
Here you go:
>>> lists = [[1, "text", "name1", "text"],
... [2, "text", "name2", "text"],
... [3, "text", "name2", "text"],
... [4, "text", "name1", "text"]]
>>> groupers = set(l[2] for l in lists)
>>> groupers
set(['name2', 'name1'])
>>> l = [filter(lambda x: x[2] == y, lists) for y in groupers]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]
Of course, you can write all the grouping logic in one line:
>>> l = [filter(lambda x: x[2] == y, lists) for y in set(l[2] for l in lists)]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]
source to share
The easiest way to do this is with a key
function argument sorted()
. In your example:
>>> a = [[1, "text", "name1", "text"],
[2, "text", "name2", "text"],
[3, "text", "name2", "text"],
[4, "text", "name1", "text"]]
>>> sorted(a[:], key=lambda item:item[2])
>>> [[1, 'text', 'name1', 'text'],
[4, 'text', 'name1', 'text'],
[2, 'text', 'name2', 'text'],
[3, 'text', 'name2', 'text']]
More information about this argument can be found at this link .
source to share
Use sorted
with the element you want to sort like key
and itertools groupby
to group 'em:
>>> from itertools import groupby
>>> sl = sorted(your_list, key=lambda your_list: your_list[2])
>>> [list(v) for k,v in groupby(sl, key=lambda sl:sl[2])]
[[[1, 'text', 'name1', 'text'],
[4, 'text', 'name1', 'text']],
[[2, 'text', 'name2', 'text'],
[3, 'text', 'name2', 'text']]]
source to share
The following function will quickly ( not sort ) subsequences of a group of any length using the key of the specified index :
# given a sequence of sequences like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],
# returns a dict grouping sequences by idx-th element - with idx=1 we have:
# if merge is True {'c':(3,6,88,4), 'a':(7,2,45,0)}
# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by_idx(seqs,idx=0,merge=True):
d = dict()
for seq in seqs:
if isinstance(seq,tuple): seq_kind = tuple
if isinstance(seq,list): seq_kind = list
k = seq[idx]
v = d.get(k,seq_kind()) + (seq[:idx]+seq[idx+1:] if merge else seq_kind((seq[:idx]+seq[idx+1:],)))
d.update({k:v})
return d
In the case of your question, the key is the item at index 2, so
group_by_idx(your_list,2,False)
gives
{'name1': [[1, 'text', 'text'], [4, 'text', 'text']],
'name2': [[2, 'text', 'text'], [3, 'text', 'text']]}
which is not exactly the result you asked for, but may also suit your needs.
source to share