Python groupby behaves strange
from itertools import groupby
source = [ [1,2], [1,3], [2, 1] ]
gby = groupby(source, lambda x: x[0])
print 'as list'
for key, vals in list(gby):
print 'key {}'.format(key)
for val in vals:
print ' val {}'.format(val)
print
print 'as iter'
gby = groupby(source, lambda x: x[0])
for key, vals in gby:
print 'key {}'.format(key)
for val in vals:
print ' val {}'.format(val)
Results:
as list
key 1
key 2
val [2, 1]
as iter
key 1
val [1, 2]
val [1, 3]
key 2
val [2, 1]
What's up with list(gby)
? I would expect to list
be a pure function, how can it damage the internal state?
source to share
the documentation makes a note about this:
The returned group is itself an iterator that shares the basic iterable with groupby (). Since the source is shared, when the groupby () object is expanded, the previous group is no longer displayed. Thus, if this data is needed later, it must be saved as a list:
groups = [] uniquekeys = [] data = sorted(data, key=keyfunc) for k, g in groupby(data, keyfunc): groups.append(list(g)) # Store group iterator as a list uniquekeys.append(k)
You wear out the object groupby
(by making it a list) before you try to iterate over the returned iterators of the group, so all groups other than the last group are lost.
The reason for this is easier to understand by looking at the Python function implementation of the function:
class groupby(object):
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
def __init__(self, iterable, key=None):
if key is None:
key = lambda x: x
self.keyfunc = key
self.it = iter(iterable)
self.tgtkey = self.currkey = self.currvalue = object()
def __iter__(self):
return self
def next(self):
while self.currkey == self.tgtkey:
self.currvalue = next(self.it)
self.currkey = self.keyfunc(self.currvalue)
self.tgtkey = self.currkey
return (self.currkey, self._grouper(self.tgtkey))
def _grouper(self, tgtkey): # This is the "group" iterator
while self.currkey == tgtkey: # self.currkey != tgtkey if you advance groupby and then try to use this object.
yield self.currvalue
self.currvalue = next(self.it)
self.currkey = self.keyfunc(self.currvalue)
The call next(groupby)
transfers the internal pointer to the underlying iterable ( self.currvalue
) to the next key, then returns the current key ( self.currkey
) and the iterator _grouper
. _grouper
takes the current key as an argument (called tgtkey
) and will give values (and recalculate self.currkey
) until self.currkey
it is different from tgtkey
, which means it returned all values that match the current key. So, if you advance groupby
to using an object _grouper
, it self.currkey
will never equal tgtkey
, so the iterator _grouper
will return nothing.
If for some reason you need to store the results groupby
in a list, you should do it like this:
gby_list = []
for key, vals in gby:
gby_list.append(key, list(vals))
Or:
gby_list = [key, list(vals) for key, vals in gby]
source to share