Fast indexing and partitioning of boolean data with Python

Question

Fast indexing and partitioning of boolean data with Python

Using Python, I run a simulation where a community of species goes through a sequential set of time steps ("scenes"), in each of which a disappearance occurs. From an initial set of N species, each extinction must select the number of survivors, which then forms a pool to be sub-sampled at the next extinction. The number of survivors at each step is arbitrarily drawn from the binomial distribution, taking into account the size of the community and the probability of each species existence.

The examples below show one chain of steps, but in practice the solution should be able to handle branching when a community surviving at one time step splits into two separate trajectories, each of which undergoes its own independent extinction.

As a sketch of the process:

1111111111111111  (Initial 16 species, choose 11 survivors)
0110110101101111  (11 species, choose 9 survivors)
0110110101100011  (9 species, choose 5 survivors)
0100100000100011  (End of simulation)

This process is widely used and communities can get quite large, so I try to speed it up as quickly as possible and save memory usage. I currently have three competing implementations

A) Using numpy boolean matrix to store live views at each time step. The original motivation for this was that it would have a lower memory profile, just keeping the presence / absence of the view, but numpy

using a full byte to store booleans, so that's eight times less memory than I thought!

import numpy as np

def using_2D_matrix(nspp=1000, nscene=250):

    # define a matrix to hold the communities and 
    # set the initial community 
    m = np.zeros((nscene, nspp), dtype='bool_')
    m[0, ] = 1

    # loop over each extinction scene, looking up the indices
    # of live species and then selecting survivors
    for i in range(0, nscene - 1):
        candidates = np.where(m[i,])[0]
        n_surv = np.random.binomial(len(candidates), 0.99)
        surv = np.random.choice(candidates, size=n_surv, replace=False)
        m[i + 1, surv] = 1

    return m

B) Thus, storing a dictionary of 1D arrays containing unique indices for surviving species eliminates the need for use np.where

. It may have more memory usage because it probably needs to be used to store IDs, uint32

but where extinction is great, you only need to store a short list of indices, not a whole string of a boolean array, so be specific.

def using_dict_of_arrays(nspp=1000, nscene=250):

    # initialise a dictionary holding an array giving a 
    # unique integer to each species
    m = {0: np.arange(nspp)}

    # loop over the scenes, selecting survivors
    for i in range(0, nscene - 1):
        n_surv = np.random.binomial(len(m[i]), 0.99)
        surv = np.random.choice(m[i], size=n_surv, replace=False)
        m[i + 1] = surv

    return m

Of these, B is faster by about 10-15%.

import timeit
A = timeit.Timer(using_2D_matrix)
A.timeit(100)
# 1.6549
B = timeit.Timer(using_dictionary_of_arrays)
B.timeit(100)
# 1.3580

C) Then I thought about it, using bitarray

to keep the presence or absence of species in communities compact as actual bits. It can also provide efficiency with bits for comparing matches across communities. So:

def using_bitarray(nspp=1000, nscene=250):
    # initialise the starting community
    m = {0: bitarray('1' * nspp)}

    for i in range(0, nscene):
        # pick how many die and which they are (fewer bits to swap)
        n_die = np.random.binomial(m[i].count(True), 0.01)
        unlucky = np.random.choice(m[i].search(bitarray('1')), size=n_die, replace=False)
        # clone the source community and kill some off
        m[i + 1] = bitarray(m[i])
        for s in unlucky:
            m[i + 1][s] = False

    return m

This is all good, but quite a bit slower.

C = timeit.Timer(using_bitarray)
C.timeit(100)
# 2.54035

Am I missing an approach that will run faster?

+3

performance python bit-manipulation numpy

user3401916 Apr 21 17 at 11:30

source to share

3 answers

Here's an alternative that's pretty fast:

def using_shuffled_array(nspp=1000, nscene=250):
    a = np.arange(nspp)
    np.random.shuffle(a)

    m = np.zeros(nscene, dtype=int)
    m[0] = nspp

    # loop over the scenes, selecting survivors
    for i in range(0, nscene - 1):
        m[i + 1] = np.random.binomial(m[i], 0.99)

    return a, m

Instead of generating a separate array for each generation, it shuffles the initial sequence of species numbers once, and then determines the number of survivors for each generation. After being summoned a, m = using_shuffled_array()

, a[:m[k]]

gives survivors upon generation k

.

Here's a time comparison:

In [487]: %timeit using_dict_of_arrays()
100 loops, best of 3: 7.93 ms per loop

In [488]: %timeit using_shuffled_array()
1000 loops, best of 3: 607 µs per loop

+1

Warren weckesser Apr 21 17 at 13:08

source to share

Proposed approach

The long version with probability as a parameter and handling the case when there candidates

may be n empty array, we need to exit from / break, would look something like this:

def using_2D_matrix(nspp=1000, nscene=250, prob=0.99):
    m = np.zeros((nscene, nspp), dtype='bool_')
    m[0, ] = 1
    for i in range(0, nscene - 1):
        candidates = np.where(m[i,])[0]
        if len(candidates)==0:
            break    
        n_surv = np.random.binomial(len(candidates), prob)
        surv = np.random.choice(candidates, size=n_surv, replace=False)
        m[i + 1, surv] = 1
    return m

Now, on closer inspection, we can see that the code basically picks random unique items for each row and for subsequent rows, continues to pick unique ones, but only those already selected for the previous rows. The number of unique items to select is based on the t22 probability parameter. So with a similar high probability 0.99

it will choose 0.99%

for the second row, since for the first row we have already selected all with m[0, ] = 1

. Then for the third row will be 0.99%

selected from the second row, which will be 0.99*0.99%=0.9801%

and so on. So the pattern is that we select items 0.99^([0,1,2,3...])

for each row, starting from the first row.

The idea that we could use here is that if we could create a 2D array with all possible indices for each row, which would be randomly scattered and select the first 100% elements for the first row, first 0.99%

for the second row, the first elements 0.9801

for the third row, etc., these will be the column indices to be set in the output mask array.

What's the whole idea out there to give us a vectorized solution!

Implementation -

def vectorized_app(nspp=1000, nscene=250, prob=0.99):
    r = np.arange(nscene)
    lims = np.rint(nspp*(prob**(r))).astype(int)

    rands = np.random.rand(nscene, nspp).argpartition(0,axis=1)

    mask = lims[:,None] > np.arange(nspp)
    row_idx = np.repeat(r,lims)
    col_idx = rands[mask]

    out = np.zeros((nscene, nspp), dtype='bool')
    out[row_idx, col_idx] = 1
    return out

Example run -

In [159]: out = vectorized_app(nspp=1000, nscene=250, prob=0.99)

In [160]: s = out.sum(1)

In [161]: s
Out[161]: 
array([1000,  990,  980,  970,  961,  951,  941,  932,  923,  914,  904,
        895,  886,  878,  869,  860,  851,  843,  835,  826,  818,  810,
        ...........................................
         88,   87,   86,   85,   84,   84,   83,   82])

Timing and related discussion

Let's check the performance -

In [119]: %timeit using_2D_matrix(nspp=1000, nscene=250, prob=0.99)
100 loops, best of 3: 8 ms per loop

In [120]: %timeit vectorized_app(nspp=1000, nscene=250, prob=0.99)
100 loops, best of 3: 3.76 ms per loop

In [121]: 8/3.76
Out[121]: 2.127659574468085

Now the bottleneck for the proposed approach will be the generation of random numbers, in particular the amount of random numbers needed. Thus, the vectorized approach will be at a disadvantage if you are working with a large number nspp

and a relatively smaller number nscene

, through which the iterative version of the loop is executed -

In [143]: %timeit using_2D_matrix(nspp=10000, nscene=2500, prob=0.99)
10 loops, best of 3: 53.8 ms per loop

In [144]: %timeit vectorized_app(nspp=10000, nscene=2500, prob=0.99)
1 loops, best of 3: 309 ms per loop

With nscene

being a large number, the results will be in favor of the vectorized -

In [145]: %timeit using_2D_matrix(nspp=100, nscene=2500, prob=0.99)
100 loops, best of 3: 10.6 ms per loop

In [146]: %timeit vectorized_app(nspp=100, nscene=2500, prob=0.99)
100 loops, best of 3: 3.24 ms per loop

In [147]: %timeit using_2D_matrix(nspp=10, nscene=2500, prob=0.99)
100 loops, best of 3: 5.72 ms per loop

In [148]: %timeit vectorized_app(nspp=10, nscene=2500, prob=0.99)
1000 loops, best of 3: 589 µs per loop

Lessons taken

Through the ideas we went through trying to come up with a proposed solution, the trick I learned in the process is that we can create unique random numbers in a string using random numnber generation and then use np.argpartition

. Here's a sample of it has unique elements per line -

In [149]: np.random.rand(3, 4).argpartition(0,axis=1)
Out[149]: 
array([[3, 1, 2, 0],
       [0, 1, 2, 3],
       [1, 0, 2, 3]])

+1

Divakar Apr 21 17 at 13:18

source to share

PM 2Ring · Accepted Answer · 2017-04-21T13:11:30+0000

You can speed up the process without finding and counting survivors at every turn.

Let p be the probability that the survivor will survive at this stage. Instead of looking for every survivor and marking them extinct with probability p, we simply kill all species with probability p, regardless of whether they are currently survivors or not. Here's a short proof of concept.

import numpy as np

np.random.seed(42)

def test(nspp, nscene):
    m = np.zeros((nscene, nspp), dtype=np.uint8)
    m[0,] = 1
    for i in range(1, nscene):
        m[i] = m[i - 1] & (np.random.ranf(nspp) < 0.9)
    return m

m = test(10000, 10)
print(np.sum(m, axis=1))

Output

[10000  9039  8112  7298  6558  5912  5339  4829  4388  3939]

Of course, this approach means that you cannot specify the exact number of survivors at each step, but hopefully this is not necessary for your simulation.

Fast indexing and partitioning of boolean data with Python

Proposed approach

Timing and related discussion

More articles: