For each element in X, find the index of the largest without going over to Y

I am looking for a way to improve the performance of the following algorithm. For two arrays X and Y.

For each element of X, find the index of the largest value in Y that does not exceed the value of the element in X. It is safe to assume that X and Y are monotonically increasing (sorted) and that Y (1) is less than each value in X. Also , X is usually much larger than Y.

The following is given as an example.

X = [0.2, 1.5, 2.2, 2.5, 3.5, 4.5, 5.5, 5.8, 6.5];
Y = [0.0, 1.0, 3.0, 4.0, 6.0];

      

I expect the output to be

idx = [1, 2, 2, 2, 3, 4, 4, 4, 5]

      

The fastest way I have come across is the function below which fails to take advantage of the fact that the lists are sorted and uses a for loop to step through one of the arrays. This gives the correct solution, but in the experiments in which I use this feature, almost 27 minutes are spent here out of the total 30 minute analysis that needs to be done.

function idx = matchintervals(X,Y)
  idx = zeros(size(X));
  for i = 1:length(Y)-1
    idx(X >= Y(i) & X < Y(i+1)) = i;
  end
  idx(X >= Y(end)) = length(Y);
end

      

Any help is greatly appreciated.

+3


source to share


4 answers


If you're looking for the fastest solution, it might turn out to be a simple so-called while loop (which takes advantage of the fact that the arrays are sorted):

X = [0.2, 1.5, 2.2, 2.5, 3.5, 4.5, 5.5, 5.8, 6.5];
Y = [0.0, 1.0, 3.0, 4.0, 6.0];

xIndex = 1;
nX = numel(X);
yIndex = 1;
nY = numel(Y);
index = zeros(size(X))+nY;  % Prefill index with the largest index in Y

while (yIndex < nY) && (xIndex <= nX)
  if X(xIndex) < Y(yIndex+1)
    index(xIndex) = yIndex;
    xIndex = xIndex+1;
  else
    yIndex = yIndex+1;
  end
end

>> index

index =

     1     2     2     2     3     4     4     4     5

      

This loop will repeat a maximum of numel(X)+numel(Y)-1

times, which is potentially less if in X

there are many values ​​that are greater than the largest value in Y

.



TIME: I ran a few timings with the sample data from the comment. Here are the results sorted from fastest to slowest:

X = 1:3:(4e5);
Y = 0:20:(4e5-1);

% My solution from above:
tElapsed =
   0.003005977477718 seconds

% knedlsepp solution:
tElapsed =
   0.006939387719075 seconds

% Divakar solution:
tElapsed =
   0.011801273498343 seconds

% H.Muster solution:
tElapsed =
   4.081793325423575 seconds

      

+4


source


One-liner, but probably slower than gnovice's solution:



idx = sum(bsxfun(@ge, X, Y'));

      

+4


source


Using sort

and several masks

-

%// Concatenate X and Y and find the sorted indices
[sXY,sorted_id] = sort([X Y]);

%// Take care of sorted_id for identical values between X and Y
dup_id = find(diff(sXY)==0);
tmp = sorted_id(dup_id);
sorted_id(dup_id) = sorted_id(dup_id+1);
sorted_id(dup_id+1) = tmp;

%// Mask of Y elements in XY array
maskY = sorted_id>numel(X);

%// Find island lengths of Y elements in concatenated XY array
diff_maskY = diff([false maskY false]);
island_lens = find(diff_maskY ==-1) - find(diff_maskY ==1);

%// Create a mask of double datatype with 1s where Y intervals change
mask_Ys = [ false maskY(1:end-1)];
mask_Ysd = double(mask_Ys(~maskY));

%// Incorporate island lengths to change the 1s by offsetted island lengths
valid = mask_Ysd==1;
mask_Ysd(valid) = mask_Ysd(valid) + island_lens(1:sum(valid)) - 1;

%// Finally perform cumsum to get the output indices
idx = cumsum(mask_Ysd);

      

+2


source


I had the same idea as Divakar. This basically finds the insertion points of values X

after values Y

using stable sort

. To do this, you need to sort both X

and Y

.

%// Calculate the entry points
[~,I] = sort([Y,X]);
whereAreXs = I>numel(Y);
idx = find(whereAreXs)-(1:numel(X));

      

You can view the values X

and corresponding values Y

that do not exceed the values X

with:

%%// Output:
disp([X;Y(idx)]);

      

+2


source







All Articles