Insufficient performance of Matlab submatrix

Multiplications of submatrices in matlab seem to be significantly slower than multiplications of the matrix from which they are extracted.

>> m = rand(1e7, 40);
>> tic; m' * m; toc % (1)
Elapsed time is 0.245803 seconds.
>> tic; m(1:2.5e6, :)' * m(1:2.5e6, :); toc % (2)
Elapsed time is 1.810981 seconds.
>> tic; t = m(1:2.5e6, :); t' * t; toc % (3)
Elapsed time is 0.885764 seconds.


I was hoping there would be some quick inline methods for this, since the data is already in memory, but there seems to be no way to prevent an intermediate copy of Matlab. (3) faster but wills that we make.

Is there any method in matlab to make operators (like drag, drop, transpose) execute quickly on a subset of the matrix since they are integers?

Is this the only quick way to achieve this with mex?

Edit: Using the master column data makes things faster across the board (as expected), but still much slower when multiplying submatrices.

>> m = rand(40,1e7);
>> tic; m * m'; toc
Elapsed time is 0.251958 seconds.
>> tic; m(:, 1:2.5e6) * m(:, 1:2.5e6)'; toc
Elapsed time is 1.461321 seconds.
>> tic; s=m(:, 1:2.5e6); s * s'; toc
Elapsed time is 0.555667 seconds.


Obviously matlab takes copies when indexing, but is there a way to prevent this (it is clear that such a multiplication algorithm can exist without copying the data, I just want to know if it is expressed in Matlab).


source to share

2 answers

MATLAB is indeed copy-to-write, which means that copying a matrix doesn't actually copy the data, at least until you try to change one of the two matrices that point to the same data. However, when you copy a part of a matrix, the data is always copied. That is, a matrix never points to a subset of the values ​​of another matrix. MATLAB always stores data sequentially in memory (column order), and therefore it rarely happens that the portion of a matrix you are indexing is sequential in memory. One way to try this is to use format debug

and look at data pointers:

>> format debug
>> q=rand(10,100)
q =
Structure address = 127e64560
m = 10
n = 100
pr = 7f9f5c835420
pi = 0
  Columns 1 through 9
    0.8147    0.1576    0.6557    0.7060    0.4387    0.2760    0.7513    0.8407    0.3517

>> q(:,1:2)
ans =
Structure address = 11dadd750
m = 10
n = 2
pr = 7f9f5b792700
pi = 0
  0.8147    0.1576


Notice how the value pr

(pointer to the real piece of data) changes . This indicates that the data is being copied.

If you cannot use vectorized code (i.e. apply the whole matrix operation), the best option for speeding up the computations is the MEX file.



More test results here.

Five methods have been tested for performing matrix multiplication, 3 of which are related to submatrices. As @Cris pointed out, there rand(1e7,40)

could be a difference in use , so another one-piece set is tested.

To improve accuracy, a 10-loop loop and Profiler were used. Tested on i7 with ample RAM.

Test code

% clear;clc;close all

A = rand(1e7, 40);
for ii = 1:10
    m = A;
    m(end) = m(end-1);

    m' * m; % 1

    n = m';
    n * m; % 2

    m(1:2.5e6, :)' * m(1:2.5e6, :); % 3

    t = m(1:2.5e6, :);
    t' * t; % 4

    t2 = t';
    t2 * t; % 5

    clearvars -except A
clearvars A
B = rand(40, 1e7);
for ii = 1:10
    m = B;
    m(end) = m(end-1);

    m * m'; % 1

    n = m';
    m * n; % 2

    m(:,1:2.5e6) * m(:,1:2.5e6)'; % 3

    t = m(:,1:2.5e6);
    t * t'; % 4

    t2 = t';
    t * t2; % 5

    clearvars -except B


Profiling result

Total time for each method

Profiler screenshot


  • Any time difference less than 10% is considered identical.
  • At the beginning of each loop, the matrix is ​​forced to copy by a trivial assignment. The copy is rather slow (11.5 sec).
  • For a complete matrix, there is no difference between the MxN and NxM dimensions. However, for the submatrix test 40x2.5e6

    , it is obviously faster.
  • Transpose buffering slows down the whole process by 100%. I think this is because Matlab loses its ability to optimize; probably, if one type m'*m

    , it identifies the pattern and skips the transpose operation.
  • Substitution and then hyphenation is not optimized.
  • Subset buffering speeds up multiplication.


All Articles