Finding sorted items in a sorted sequence
I want to find a sequence of elements in a sorted array of values. I know that with numpy I can do:
l = np.searchsorted(values, items)
This is O (len (items) * log (len (values))) complexity. However, my objects are sorted as well, so I can do this in O (len (items) + len (values)):
l = np.zeros(items.size, dtype=np.int32)
k, K = 0, len(values)
for i in range(len(items)):
while k < K and values[k] < items[i]:
k += 1
l[i] = k
The problem is that this pure python version is slower than searchsorted because of the python loop, even for large len (elements) and len (values) (~ 10 ^ 6).
Any idea how to "vectorize" this loop with numpy?
source to share
Some sample data:
# some example data
np.random.seed(0)
n_values = 1000000
n_items = 100000
values = np.random.rand(n_values)
items = np.random.rand(n_items)
values.sort()
items.sort()
Original code snippet as well as implementation of @ PeterE suggestion:
def original(values, items):
l = np.empty(items.size, dtype=np.int32)
k, K = 0, len(values)
for i, item in enumerate(items):
while k < K and values[k] < item:
k += 1
l[i] = k
return l
def peter_e(values, items):
l = np.empty(items.size, dtype=np.int32)
last_idx = 0
for i, item in enumerate(items):
last_idx += values[last_idx:].searchsorted(item)
l[i] = last_idx
return l
Validate against naive np.searchsorted
:
ss = values.searchsorted(items)
print(all(original(values, items) == ss))
# True
print(all(peter_e(values, items) == ss))
# True
Timings:
In [1]: %timeit original(values, items)
10 loops, best of 3: 115 ms per loop
In [2]: %timeit peter_e(values, items)
10 loops, best of 3: 79.8 ms per loop
In [3]: %timeit values.searchsorted(items)
100 loops, best of 3: 4.09 ms per loop
So, for inputs of this size, naive usage np.searchsorted
easily strikes your source code as well as PeterE's suggestion.
Update
To avoid any caching effects that could skew the timings, we can create a new set of random input arrays for each iteration of the reference:
In [1]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
original(values, items)
.....:
10 loops, best of 3: 115 ms per loop
In [2]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
peter_e(values, items)
.....:
10 loops, best of 3: 79.9 ms per loop
In [3]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
values.searchsorted(items)
.....:
100 loops, best of 3: 4.08 ms per loop
Update 2
It's not that hard to write a Cython function that will beat np.searchsorted
for the case where both are sorted values
and items
.
search_doubly_sorted.pyx
:
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def search_doubly_sorted(values, items):
cdef:
double[:] _values = values.astype(np.double)
double[:] _items = items.astype(np.double)
long n_items = items.shape[0]
long n_values = values.shape[0]
long[:] out = np.empty(n_items, dtype=np.int64)
long ii, jj, last_idx
last_idx = 0
for ii in range(n_items):
for jj in range(last_idx, n_values):
if _items[ii] <= _values[jj]:
break
last_idx = jj
out[ii] = last_idx
return out.base
Check correctness:
In [1]: from search_doubly_sorted import search_doubly_sorted
In [2]: print(all(search_doubly_sorted(values, items) == values.searchsorted(items)))
# True
Benchmark:
In [3]: %timeit values.searchsorted(items)
100 loops, best of 3: 4.07 ms per loop
In [4]: %timeit search_doubly_sorted(values, items)
1000 loops, best of 3: 1.44 ms per loop
However, the performance improvement is quite marginal. Unless this is a major bottleneck in your code, you should probably stick with it np.searchsorted
.
source to share