Python groupby behaves strange

from itertools import groupby

source = [ [1,2], [1,3], [2, 1] ]
gby = groupby(source, lambda x: x[0])

print 'as list'
for key, vals in list(gby):
    print 'key {}'.format(key)
    for val in vals:
        print '  val {}'.format(val)

print

print 'as iter'
gby = groupby(source, lambda x: x[0])
for key, vals in gby:
    print 'key {}'.format(key)
    for val in vals:
        print '  val {}'.format(val)

      

Results:

as list
key 1
key 2
  val [2, 1]

as iter
key 1
  val [1, 2]
  val [1, 3]
key 2
  val [2, 1]

      

What's up with list(gby)

? I would expect to list

be a pure function, how can it damage the internal state?

+3


source to share


1 answer


the documentation makes a note about this:

The returned group is itself an iterator that shares the basic iterable with groupby (). Since the source is shared, when the groupby () object is expanded, the previous group is no longer displayed. Thus, if this data is needed later, it must be saved as a list:

groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
    groups.append(list(g))      # Store group iterator as a list
    uniquekeys.append(k)

      

You wear out the object groupby

(by making it a list) before you try to iterate over the returned iterators of the group, so all groups other than the last group are lost.

The reason for this is easier to understand by looking at the Python function implementation of the function:

class groupby(object):
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def next(self):
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey))
    def _grouper(self, tgtkey):  # This is the "group" iterator
        while self.currkey == tgtkey:  # self.currkey != tgtkey if you advance groupby and then try to use this object.
            yield self.currvalue
            self.currvalue = next(self.it)
            self.currkey = self.keyfunc(self.currvalue)

      



The call next(groupby)

transfers the internal pointer to the underlying iterable ( self.currvalue

) to the next key, then returns the current key ( self.currkey

) and the iterator _grouper

. _grouper

takes the current key as an argument (called tgtkey

) and will give values ​​(and recalculate self.currkey

) until self.currkey

it is different from tgtkey

, which means it returned all values ​​that match the current key. So, if you advance groupby

to using an object _grouper

, it self.currkey

will never equal tgtkey

, so the iterator _grouper

will return nothing.

If for some reason you need to store the results groupby

in a list, you should do it like this:

gby_list = []
for key, vals in gby:
    gby_list.append(key, list(vals))

      

Or:

gby_list = [key, list(vals) for key, vals in gby]

      

+4


source







All Articles