Boolean vector based column selection in numpy

I have two NumPy arrays a

, b

sized m

by n

. I have a Boolean vector b

of length n

, and I want to create a new array c

that selects columns n

from a

, b

so if b[i]

- true, I take from the column b

otherwise from a

.

How can I do this in the most efficient way? I looked at select

, where

and choose

.

+3


source to share


3 answers


First of all, let's customize the example code:

import numpy as np

m, n = 5, 3
a = np.zeros((m, n))
b = np.ones((m, n))

boolvec = np.random.randint(0, 2, m).astype(bool)

      

Just to show what this data looks like:

In [2]: a
Out[2]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [3]: b
Out[3]: 
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [4]: boolvec
Out[4]: array([ True,  True, False, False, False], dtype=bool)

      



In this case, it is most effective to use for this np.where

. However, we need to boolvec

have a shape that can be broadcast in the same form as a

, and b

. So we can make it a column vector by slicing it with np.newaxis

or None

(they are the same):

In [5]: boolvec[:,None]
Out[5]: 
array([[ True],
       [ True],
       [False],
       [False],
       [False]], dtype=bool)

      

And then we can make the final result with np.where

:

In [6]: c = np.where(boolvec[:, None], a, b)

In [7]: c
Out[7]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

      

+4


source


You can use np.choose

for this.

For example, a

and b

arrays:

>>> a = np.arange(12).reshape(3,4)
>>> b = np.arange(12).reshape(3,4) + 100
>>> a_and_b = np.array([a, b])

      

To use np.choose

, we need a 3D array with two arrays; a_and_b

as follows:



array([[[  0,   1,   2,   3],
        [  4,   5,   6,   7],
        [  8,   9,  10,  11]],

       [[100, 101, 102, 103],
        [104, 105, 106, 107],
        [108, 109, 110, 111]]])

      

Now let the boolean array be bl = np.array([0, 1, 1, 0])

. Then:

>>> np.choose(bl, a_and_b)
array([[  0, 101, 102,   3],
       [  4, 105, 106,   7],
       [  8, 109, 110,  11]])

      

+4


source


Timing for (5000,3000) arrays:

In [107]: timeit np.where(boolvec[:,None],b,a)
1 loops, best of 3: 993 ms per loop

In [108]: timeit np.choose(boolvec[:,None],[a,b])
1 loops, best of 3: 929 ms per loop

In [109]: timeit c=a[:];c[boolvec,:]=b[boolvec,:]
1 loops, best of 3: 786 ms per loop

      

where

and choose

are essentially the same; boolean indexing is slightly faster. select

uses choose

, so I didn't have time.


My timings for fetching columns are similar, except that indexing is slower:

In [119]: timeit np.where(cols,b,a)
1 loops, best of 3: 878 ms per loop

In [120]: timeit np.choose(cols,[a,b])
1 loops, best of 3: 915 ms per loop

In [121]: timeit c=a[:];c[:,cols]=b[:,cols]
1 loops, best of 3: 1.25 s per loop

      

Correction, for indexing I have to use a.copy()

.

In [32]: timeit c=a.copy();c[boolvec,:]=b[boolvec,:]
1 loops, best of 3: 783 ms per loop
In [33]: timeit c=a.copy();c[:,cols]=b[:,cols]
1 loops, best of 3: 1.44 s per loop

      

I get the same timings for Python2.7 and 3, numpy 1.8.2 and 1.9.0 dev

+3


source







All Articles