Finding sorted items in a sorted sequence

Question

Finding sorted items in a sorted sequence

I want to find a sequence of elements in a sorted array of values. I know that with numpy I can do:

l = np.searchsorted(values, items)

This is O (len (items) * log (len (values))) complexity. However, my objects are sorted as well, so I can do this in O (len (items) + len (values)):

l = np.zeros(items.size, dtype=np.int32)
k, K = 0, len(values)
for i in range(len(items)):
    while k < K and values[k] < items[i]:
        k += 1
    l[i] = k

The problem is that this pure python version is slower than searchsorted because of the python loop, even for large len (elements) and len (values) (~ 10 ^ 6).

Any idea how to "vectorize" this loop with numpy?

+3

performance python big-o numpy binary-search

rndblnch Apr 16 15 at 17:05

source to share

1 answer

ali_m · Accepted Answer · 2015-04-16T17:31:15+0000

Some sample data:

# some example data
np.random.seed(0)
n_values = 1000000
n_items = 100000
values = np.random.rand(n_values)
items = np.random.rand(n_items)
values.sort()
items.sort()

Original code snippet as well as implementation of @ PeterE suggestion:

def original(values, items):
    l = np.empty(items.size, dtype=np.int32)
    k, K = 0, len(values)
    for i, item in enumerate(items):
        while k < K and values[k] < item:
            k += 1
        l[i] = k
    return l

def peter_e(values, items):
    l = np.empty(items.size, dtype=np.int32)
    last_idx = 0
    for i, item in enumerate(items):
        last_idx += values[last_idx:].searchsorted(item)
        l[i] = last_idx
    return l

Validate against naive np.searchsorted

:

ss = values.searchsorted(items)

print(all(original(values, items) == ss))
# True

print(all(peter_e(values, items) == ss))
# True

Timings:

In [1]: %timeit original(values, items)
10 loops, best of 3: 115 ms per loop

In [2]: %timeit peter_e(values, items)
10 loops, best of 3: 79.8 ms per loop

In [3]: %timeit values.searchsorted(items)
100 loops, best of 3: 4.09 ms per loop

So, for inputs of this size, naive usage np.searchsorted

easily strikes your source code as well as PeterE's suggestion.

Update

To avoid any caching effects that could skew the timings, we can create a new set of random input arrays for each iteration of the reference:

In [1]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
original(values, items)
   .....: 
10 loops, best of 3: 115 ms per loop

In [2]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
peter_e(values, items)
   .....: 
10 loops, best of 3: 79.9 ms per loop

In [3]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
values.searchsorted(items)
   .....: 
100 loops, best of 3: 4.08 ms per loop

Update 2

It's not that hard to write a Cython function that will beat np.searchsorted

for the case where both are sorted values

and items

.

search_doubly_sorted.pyx

:

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def search_doubly_sorted(values, items):

    cdef:
        double[:] _values = values.astype(np.double)
        double[:] _items = items.astype(np.double)
        long n_items = items.shape[0]
        long n_values = values.shape[0]
        long[:] out = np.empty(n_items, dtype=np.int64)
        long ii, jj, last_idx

    last_idx = 0
    for ii in range(n_items):
        for jj in range(last_idx, n_values):
             if _items[ii] <= _values[jj]:
                break
        last_idx = jj
        out[ii] = last_idx

    return out.base

Check correctness:

In [1]: from search_doubly_sorted import search_doubly_sorted

In [2]: print(all(search_doubly_sorted(values, items) == values.searchsorted(items)))                     
# True

Benchmark:

In [3]: %timeit values.searchsorted(items)
100 loops, best of 3: 4.07 ms per loop

In [4]: %timeit search_doubly_sorted(values, items)
1000 loops, best of 3: 1.44 ms per loop

However, the performance improvement is quite marginal. Unless this is a major bottleneck in your code, you should probably stick with it np.searchsorted

.

Finding sorted items in a sorted sequence

Update

Update 2

More articles: