Alternative to using square shape (Matlab)

I am currently using a function pdist

in Matlab to compute the Euclidean distances between different points in a 3D Cartesian system. I do this because I want to know which point has the smallest average distance to all other points (medoid). The syntax for pdist

looks like this:

% calculate distances between all points
distances = pdist(m);

      

But since pdist returns a one-dimensional array of distances, there is no easy way to figure out which point has the smallest average distance (directly). This is why I use squareform

and then calculates the smallest average distance, for example:

% convert found distances to matrix of distances
distanceMatrix = squareform(distances);

% find index of point with smallest average distance
[~,j] = min(mean(distanceMatrix,2));

      

The distances are averaged for each column, and the variable j

is the index for the column (and point) with the smallest average distance.

This works, but squareform takes a long time (this piece of code is repeated thousands of times), so I'm looking for a way to optimize it. Does anyone know of a faster way to output the point with the smallest average distance from the results pdist

?

+3


source to share


4 answers


I think for your task using the SQUAREFORM function this is the best way from a vectorization point of view. If you look at the content of this function at

edit squareform

      

You will see that it does a lot of checks that take time, of course. Since you know your square input and can be sure that it will work, you can create your custom function with just the square kernel.



[r, c] = size(m);
distanceMatrix = zeros(r);
distanceMatrix(tril(true(r),-1)) = distances;
distanceMatrix = distanceMatrix + distanceMatrix';

      

Then run the same code as you to find medioid.

+3


source


Here's an implementation that doesn't require accessing the form:

N1 = 10;
dim = 5;

% generate points
X = randn(N1, dim);

% find mean distance
for iter=N1:-1:1
    d_mean(iter) = mean(pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean'));
    % D(iter,:) = pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean');
end

[val ind] = min(d_mean);

      

But without knowing more about your problem, I have no idea if it will be faster.



If this is a feature for your program, you may need to consider other speedup options such as mex.

Good luck.

+1


source


When pdist calculates the distances between pairs of observations (1,2, ..., n), the distances are in the following order:

(2,1), (3,1), ..., (m, 1), (3,2), ..., (m, 2), ..., (m, m- 1))

To demonstrate this, try the following:

> X = [.2 .1 .7 .5]';
> D = pdist(X)
.1  .5  .3   .6  .4  .2

      

In this example, X stores n = 4 cases. Result D is a vector of distances between observations (2,1), (3,1), (4,1), (3,2), (4,2), (5,4). This layout corresponds to the entries of the lower triangular portion of the following n-by-n matrix:

M =
  0 0 0 0
.1 0 0 0
.5.6 0 0
.3.4.2 0

Note that D ( 1 ) = M ( 2,1 ), D ( 2 ) = ( 3,1 ), etc. So one way to get a pair of indices in M ​​that correspond to D (k) would be to compute the linear index of D (k) in M. This can be done like this:

% matrix size
n = 4;
% r(j) is the no. of elements in cols 1..j, belonging to the upper triangular part 
r = cumsum(1:n-1);       
% p(j) is the no. elements in cols 1..j, belonging to the lower triangular part 
p = cumsum(n-1:-1:1);
% The linear index of value D(k)
q = find(p >= k, 1);
% The subscript indices of value D(k)
[i j] = ind2sub([n n], k + r(q));

      

Note that n, r and p only need to be set once. From now on, you can find the index for any given k using the last two lines. Let's check it out:

for k = 1:6
   q = find(p >= k, 1);
   [i, j] = ind2sub([n n], k + r(q));
   fprintf('D(%d) is the distance between observations (%d %d)\n', k, i, j);
end

      

Here the conclusion:
D (1) - distance between observations (2 1)
D (2) - distance between observations (3 1)
D (3) - distance between observations (4 1)
D (4) - distance between observations (3 2 )
D (5) - distance between observations (4 2)
D (6) - distance between observations (4 3)

+1


source


No need to use squareform

:

distances = pdist(m);
l=length(distances);
n=(1+sqrt(1+4*l))/2;
m=[];
for i=1:n
  idx=[1+i:n:length(distances)];
  m(i)=mean(distances(idx));
end

j=min(m);

      

I'm not sure, but maybe it could be vectorized as well, but now it's too late.

0


source







All Articles