For each element in X, find the index of the largest without going over to Y
I am looking for a way to improve the performance of the following algorithm. For two arrays X and Y.
For each element of X, find the index of the largest value in Y that does not exceed the value of the element in X. It is safe to assume that X and Y are monotonically increasing (sorted) and that Y (1) is less than each value in X. Also , X is usually much larger than Y.
The following is given as an example.
X = [0.2, 1.5, 2.2, 2.5, 3.5, 4.5, 5.5, 5.8, 6.5];
Y = [0.0, 1.0, 3.0, 4.0, 6.0];
I expect the output to be
idx = [1, 2, 2, 2, 3, 4, 4, 4, 5]
The fastest way I have come across is the function below which fails to take advantage of the fact that the lists are sorted and uses a for loop to step through one of the arrays. This gives the correct solution, but in the experiments in which I use this feature, almost 27 minutes are spent here out of the total 30 minute analysis that needs to be done.
function idx = matchintervals(X,Y)
idx = zeros(size(X));
for i = 1:length(Y)-1
idx(X >= Y(i) & X < Y(i+1)) = i;
end
idx(X >= Y(end)) = length(Y);
end
Any help is greatly appreciated.
source to share
If you're looking for the fastest solution, it might turn out to be a simple so-called while loop (which takes advantage of the fact that the arrays are sorted):
X = [0.2, 1.5, 2.2, 2.5, 3.5, 4.5, 5.5, 5.8, 6.5];
Y = [0.0, 1.0, 3.0, 4.0, 6.0];
xIndex = 1;
nX = numel(X);
yIndex = 1;
nY = numel(Y);
index = zeros(size(X))+nY; % Prefill index with the largest index in Y
while (yIndex < nY) && (xIndex <= nX)
if X(xIndex) < Y(yIndex+1)
index(xIndex) = yIndex;
xIndex = xIndex+1;
else
yIndex = yIndex+1;
end
end
>> index
index =
1 2 2 2 3 4 4 4 5
This loop will repeat a maximum of numel(X)+numel(Y)-1
times, which is potentially less if in X
there are many values that are greater than the largest value in Y
.
TIME: I ran a few timings with the sample data from the comment. Here are the results sorted from fastest to slowest:
X = 1:3:(4e5); Y = 0:20:(4e5-1); % My solution from above: tElapsed = 0.003005977477718 seconds % knedlsepp solution: tElapsed = 0.006939387719075 seconds % Divakar solution: tElapsed = 0.011801273498343 seconds % H.Muster solution: tElapsed = 4.081793325423575 seconds
source to share
Using sort
and several masks
-
%// Concatenate X and Y and find the sorted indices
[sXY,sorted_id] = sort([X Y]);
%// Take care of sorted_id for identical values between X and Y
dup_id = find(diff(sXY)==0);
tmp = sorted_id(dup_id);
sorted_id(dup_id) = sorted_id(dup_id+1);
sorted_id(dup_id+1) = tmp;
%// Mask of Y elements in XY array
maskY = sorted_id>numel(X);
%// Find island lengths of Y elements in concatenated XY array
diff_maskY = diff([false maskY false]);
island_lens = find(diff_maskY ==-1) - find(diff_maskY ==1);
%// Create a mask of double datatype with 1s where Y intervals change
mask_Ys = [ false maskY(1:end-1)];
mask_Ysd = double(mask_Ys(~maskY));
%// Incorporate island lengths to change the 1s by offsetted island lengths
valid = mask_Ysd==1;
mask_Ysd(valid) = mask_Ysd(valid) + island_lens(1:sum(valid)) - 1;
%// Finally perform cumsum to get the output indices
idx = cumsum(mask_Ysd);
source to share
I had the same idea as Divakar. This basically finds the insertion points of values X
after values Y
using stable sort
. To do this, you need to sort both X
and Y
.
%// Calculate the entry points [~,I] = sort([Y,X]); whereAreXs = I>numel(Y); idx = find(whereAreXs)-(1:numel(X));
You can view the values X
and corresponding values Y
that do not exceed the values X
with:
%%// Output:
disp([X;Y(idx)]);
source to share