Python - grouped duplicates in a list of lists by index

I have seen many questions about removing duplicates from a list and counting them. But I'm trying to find the best way to group them - a list of lists.

In this example, I want to group by the third field:

[[1, "text", "name1", "text"],
 [2, "text", "name2", "text"],
 [3, "text", "name2", "text"],
 [4, "text", "name1", "text"]]

      

I would like to receive the following:

[[[1, "text", "name1", "text"],
  [4, "text", "name1", "text"]],
 [[2, "text", "name2", "text"],
  [3, "text", "name2", "text"]]]

      

I can think of a naive way of wading through and just tracking what I found (O (n ^ 2)). But I would suggest a better way.

+3


source to share


5 answers


You can sort and use groupby, but this O(n log n)

:

from operator import itemgetter
from itertools import groupby

print([list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))])

      

Or use OrderedDict grouping by the third item to solve O(n)

, using the third item as the key and adding the sublist as the values. setdefault will handle duplicate keys:

from collections import OrderedDict

od = OrderedDict()

for sub in l:
    od.setdefault(sub[2],[]).append(sub)
from pprint import pprint as pp
pp(od.values())
[[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']],
[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]]

      



If order doesn't matter, you can use defaultdict instead of OrderedDict.

If order doesn't matter, then defaultdict is most efficient.

In [7]: from itertools import groupby

In [8]: from collections import OrderedDict, defaultdict                               

In [9]: l = [[1, "text", "name{}".format(choice(list(range(2000)))), "text"] for _ in xrange(40000)]

In [13]: from operator import  itemgetter

In [14]: timeit [list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))]
10 loops, best of 3: 42.5 ms per loop

In [15]: %%timeit                                                                       
od = defaultdict(list)
for sub in l:
    od[sub[2]].append(sub)
   ....: 
100 loops, best of 3: 9.42 ms per loop

In [16]: %%timeit                                                                       
od = OrderedDict()
for sub in l:
     od.setdefault(sub[2],[]).append(sub)
   ....: 
10 loops, best of 3: 25.5 ms per loop

In [17]: lists = l

In [18]: %%timeit
   ....: groupers = set(l[2] for l in lists)
   ....: [filter(lambda x: x[2] == y, lists) for y in groupers]
   ....: 

1 loops, best of 3: 8.48 s per loop

In [19]: timeit l = [filter(lambda x: x[2] == y, lists) for y in   set(l[2] for l in lists)]
1 loops, best of 3: 8.29 s per loop

      

So if order doesn't matter, then defaultdict wins, groupby still performs pretty well, since sorting is still pretty cheap compared to the quadratic approach. As you can see, the squared complexity of the filter does not perform well as the data grows.

+4


source


Here you go:

>>> lists = [[1, "text", "name1", "text"],
...  [2, "text", "name2", "text"],
...  [3, "text", "name2", "text"],
...  [4, "text", "name1", "text"]]
>>> groupers = set(l[2] for l in lists)
>>> groupers
set(['name2', 'name1'])
>>> l = [filter(lambda x: x[2] == y, lists) for y in groupers]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
 [[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]

      



Of course, you can write all the grouping logic in one line:

>>> l = [filter(lambda x: x[2] == y, lists) for y in set(l[2] for l in lists)]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
 [[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]

      

+1


source


The easiest way to do this is with a key

function argument sorted()

. In your example:

>>> a = [[1, "text", "name1", "text"], [2, "text", "name2", "text"], [3, "text", "name2", "text"], [4, "text", "name1", "text"]]

>>> sorted(a[:], key=lambda item:item[2])

>>> [[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text'], [2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]

More information about this argument can be found at this link .

0


source


Use sorted

with the element you want to sort like key

and itertools groupby

to group 'em:

>>> from itertools import groupby
>>> sl = sorted(your_list, key=lambda your_list: your_list[2])
>>> [list(v) for k,v in groupby(sl, key=lambda sl:sl[2])]
[[[1, 'text', 'name1', 'text'], 
  [4, 'text', 'name1', 'text']], 
 [[2, 'text', 'name2', 'text'], 
  [3, 'text', 'name2', 'text']]]

      

0


source


The following function will quickly ( not sort ) subsequences of a group of any length using the key of the specified index :

# given a sequence of sequences like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],
# returns a dict grouping sequences by idx-th element - with idx=1 we have:
# if merge is True {'c':(3,6,88,4),     'a':(7,2,45,0)}
# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by_idx(seqs,idx=0,merge=True):
    d = dict()
    for seq in seqs:
        if isinstance(seq,tuple): seq_kind = tuple
        if isinstance(seq,list): seq_kind = list
        k = seq[idx]
        v = d.get(k,seq_kind()) + (seq[:idx]+seq[idx+1:] if merge else seq_kind((seq[:idx]+seq[idx+1:],)))
        d.update({k:v})
    return d

      

In the case of your question, the key is the item at index 2, so

group_by_idx(your_list,2,False)

      

gives

{'name1': [[1, 'text', 'text'], [4, 'text', 'text']],
 'name2': [[2, 'text', 'text'], [3, 'text', 'text']]}

      

which is not exactly the result you asked for, but may also suit your needs.

0


source







All Articles