Efficient algorithm for expanding grouped tabular data

I am looking for an optimized tool in python to accomplish the array manipulation task I find myself over and over again. If the tool already exists, for example in numpy or pandas, I would rather implement this, rather continue using my own cythonized for loop.

I have two arrays of the same length, A and B, keeping some data about the grouped data. The I-th record of the array A tells me some property of the group i; The jth entry in array B tells me how many members there are in group j; A stores floats, B stores ints. So, for definiteness, if A [5] = 100.4 and B [5] = 7, then group 5 has a mass equal to 100.4, and 7 members of this group.

My goal is to create a new array of float, C, length B.sum (), which is an extension of the above dataset. So, C [0: B [0]] = A [0], C [B [0]: B [1]] = A [1], etc. Is there an optimized solution for this in an existing library like pandas?

My existing solution is to initialize an empty C array and then run a for loop over the A elements, indexing the common C elements as above. I am writing and compiling a for loop in citon, for speed. But this particular operation is the biggest bottleneck in my code and it looks like the horribly common array manipulation when working with tabular data, so I'm wondering if there is a highly optimized algorithm out there to do this already.

+3


source to share


4 answers


Numpy has repeat () for this type of thing.

For two arrays

A = np.array([100.4,98.3,88.5])
B = np.array([7,3,10])
np.repeat(A,B)

      



will provide you

array([ 100.4,  100.4,  100.4,  100.4,  100.4,  100.4,  100.4,   98.3,
         98.3,   98.3,   88.5,   88.5,   88.5,   88.5,   88.5,   88.5,
         88.5,   88.5,   88.5,   88.5])

      

+5


source


In [58]: A = [100.4, 50.0]

In [59]: B = [7, 5]

In [60]: [A[i] for i in range(len(B)) for _ in range(B[i])]
Out[60]: [100.4, 100.4, 100.4, 100.4, 100.4, 100.4, 100.4, 50.0, 50.0, 50.0, 50.0, 50.0]

      



+2


source


One possible way to do this is to create an iterator with functions itertools

:

>>> A = np.array([100.4,98.3,88.5])
>>> B = np.array([7,3,10])
>>>
>>> from itertools import chain, izip, repeat
>>> res = chain(*(repeat(*x) for x in izip(A,B)))
>>> list(res)
[100.4, 100.4, 100.4, 100.4, 100.4, 100.4, 100.4,
 98.3, 98.3, 98.3,
 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5, 88.5]

      

Update

>>> A1 = ['A', 3, [1,2]]
>>> A2 = [len, lambda x: x * 3, sum]
>>> B = [2, 3, 4]
>>>
>>> c = chain(*(repeat((a1, a2(a1)), b) for a1, a2, b in izip(A1, A2, B)))
>>> list(c)
[('A', 1), ('A', 1),
 (3, 9), (3, 9), (3, 9),
 ([1, 2], 3), ([1, 2], 3), ([1, 2], 3), ([1, 2], 3)]

      

The good thing about this solution is that you don't have to actually store all these elements, you can just extract it from the iterator

You can also use imap

instead of a generator:

>>> from itertools import chain, izip, repeat, imap
>>> A1 = ['A', 3, [1,2]]
>>> A2 = ['C', 4, 12]
>>> B = [2, 3, 4]
>>> for x in chain(*imap(repeat, izip(A1, A2), B)):
...     print x
... 
('A', 'C')
('A', 'C')
(3, 4)
(3, 4)
(3, 4)
([1, 2], 12)
([1, 2], 12)
([1, 2], 12)
([1, 2], 12)

      

+1


source


Okay, thanks again for stopping by, this has been an extremely useful and instructive topic for my work. I am back from vacation and will now post my test results, as requested by senderle, please call if I have not optimally coded any of the suggested solutions.

First, here's my fake data, trading verbosity for clarity (advice is appreciated to make multi-line formatting clearer):

Ngrps=int(1.e6)
grp_prop1=np.random.random(Ngrps)
grp_prop2=np.random.random(Ngrps)
grp_prop3=np.random.random(Ngrps)
grp_prop4=np.random.random(Ngrps)
grp_prop5=np.random.random(Ngrps)
grp_prop6=np.random.random(Ngrps)
grp_occupation=np.random.random_integers(0,5,size=Ngrps)

      

Now let's start with what I think is the fastest algorithm, a numpy solution that takes 0.15 seconds on my laptop suggested by Bob Haffner

mmbr_prop1=np.repeat(grp_prop1, grp_occupation)
mmbr_prop2=np.repeat(grp_prop2, grp_occupation)
mmbr_prop3=np.repeat(grp_prop3, grp_occupation)
mmbr_prop4=np.repeat(grp_prop4, grp_occupation)
mmbr_prop5=np.repeat(grp_prop5, grp_occupation)
mmbr_prop6=np.repeat(grp_prop6, grp_occupation)

      

The next fastest, at 1.21 seconds, is the generalized list comprehension suggested by inspectorG4dget

zipped_grps = zip(grp_prop1, grp_prop2, grp_prop3, grp_prop4, grp_prop5, grp_prop6)
zipped_mmbr_props = [zipped_grps[i] for i in range(len(grp_occupation)) for _ in range(grp_occupation[i])]

      

The act of fastening groups is only 2 times faster. When I do not zip the group data, the list comprehension solution takes 2.71 seconds:

z=[(grp_prop1[i], grp_prop2[i], grp_prop3[i], grp_prop4[i], grp_prop5[i], grp_prop6[i]) for i in range(len(grp_occupation)) for _ in range(grp_occupation[i])]

      

The itertools solution suggested by Roman Pekar takes 2.4 seconds:

zipped_grps = izip(grp_prop1, grp_prop2, grp_prop3, grp_prop4, grp_prop5, grp_prop6, grp_occupation)
c = chain(*(repeat((p1, p2, p3, p4, p5, p6), n) for p1, p2, p3, p4, p5, p6, n in zipped_grps))

      

Finally, the for loop I originally wrote takes 4.8 seconds:

Ntot_mbrs = grp_occupation.sum()
data=np.zeros(Ntot_mbrs*6).reshape(6, Ntot_mbrs)
first_index=0
for i in range(len(grp_occupation)):
    data[0][first_index:first_index+grp_occupation[i]] = grp_prop1[i]
    data[1][first_index:first_index+grp_occupation[i]] = grp_prop2[i]
    data[2][first_index:first_index+grp_occupation[i]] = grp_prop3[i]
    data[3][first_index:first_index+grp_occupation[i]] = grp_prop4[i]
    data[4][first_index:first_index+grp_occupation[i]] = grp_prop5[i]
    data[5][first_index:first_index+grp_occupation[i]] = grp_prop6[i]
    first_index += grp_occupation[i]

      

So, due to the suggestions made in this thread, I have sped up my code by over 30x. Thank you all very much!

0


source







All Articles