Fortran matrix multiplication performance in different optimizations

I am reading the book Science Software with Fortran and there is an exercise in it that I think is very interesting:

"Create a Fortran MatrixMultiplyModule module. Add three subroutines LoopMatrixMultiply, IntrinsicMatrixMultiply and MixMatrixMultiply to it. Each procedure must take two real matrices as an argument, multiply the matrix, and return the result through the third argument. LoopMatrixMultiply must be written entirely with do loops and without operations arrays or internal routines, IntrinsicMatrixMultiply should be written using the matmul internal function, and MixMatrixMultiply should be written using some do loops and the dot_product internal function. Write a small program to test the performance of these three different ways of doing matrix multiplication for different matrix sizes. "

I did some tests with multiplication by two ranks 2 and here are the results under different optimization flags:

enter image description hereenter image description hereenter image description here

compiler:ifort version 13.0.0 on Mac 

      

Here is my question:

Why at -00 they have the same performance, but matmul has huge performance gains when using -O3, while explicit loop and dot product have less performance? Also, why does dot_product seem to have the same performance as explicit do loops?

The code I'm using is the following:

module MatrixMultiplyModule

contains
    subroutine LoopMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=0.

        do i=1,m
            do j=1,n
                do k=1,size(mtx1,dim=2)
                    mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
                end do      
            end do
        end do
    end subroutine

    subroutine IntrinsicMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=matmul(mtx1,mtx2)

    end subroutine

    subroutine MixMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))

        do i=1,m
            do j=1,n
                    mtx3(i,j)=dot_product(mtx1(i,:),mtx2(:,j))
            end do
        end do

    end subroutine

end module




program main
use MatrixMultiplyModule
implicit none

real,allocatable        ::  a(:,:),b(:,:)
real,allocatable    ::  c1(:,:),c2(:,:),c3(:,:)
integer ::  n
integer ::  count, rate
real    ::  timeAtStart, timeAtEnd
real    ::  time(3,10)
do n=100,1000,100
    allocate(a(n,n),b(n,n))

    call random_number(a)
    call random_number(b)

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call LoopMatrixMultiply(a,b,c1)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(1,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call IntrinsicMatrixMultiply(a,b,c2)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(2,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call MixMatrixMultiply(a,b,c3)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(3,n/100)=timeAtEnd-timeAtStart


    deallocate(a,b)

end do

open(1,file="time.txt")
do n=1,10
    write(1,*) time(:,n)
end do
close(1)
deallocate(c1,c2,c3)
end program

      

+5


source to share


1 answer


There are several things to be aware of when cyclizing array elements:

  • Make sure the inner path is over successive elements in memory. In your current algorithm, the "loop" inner loop exceeds index k. Since matrices are laid out in memory in the form of columns (the first index changes the fastest when passing through memory), accessing a new value of k may require loading a new page into the cache. In this case, you can optimize your algorithm by reordering the loops like this:

    do j=1,n
        do k=1,size(mtx1,dim=2)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    
          

    now the inner loop is over sequential elements in memory (the value mtx2(k,j)

    will probably only be fetched once before the inner loop by the compiler if you cannot store it in a temporary variable before the loop)

  • Make sure all loops can fit into the cache to avoid too many cache misses. This can be done by blocking the algorithm. In this case, the solution could be, for example:

    l=size(mtx1,dim=2)
    ichunk=512 ! I have a 3MB cache size (real*4)
    do jj=1,n,ichunk
        do kk=1,l,ichunk
    
    do j=jj,min(jj+ichunk-1,n)
        do k=kk,min(kk+ichunk-1,l)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    
        end do
    end do
    
          

    in which case performance will be size dependent ichunk

    , especially for large enough matrices (you can even block the inner loop, this is just an example).

  • Make sure that the work required to complete the loop is much less work within the loop. This can be solved by a "cyclical reversal", i.e. Combining several statements into one loop iteration. Usually the compiler can do this by setting a flag -funroll-loops

    .



If I use the above code and compile with flags -O3 -funroll-loops

, I get slightly better performance than with matmul

.

The important thing to keep in mind of these three is the first point in the ordering of the loops, as this is something that will affect performance in other use cases and the compiler usually cannot fix it. By expanding the loop, you can exit to the compiler (but check it out as it doesn't always improve performance). Regarding the second point, since it is hardware dependent, you should not (usually) try to implement a very efficient matrix multiplication yourself and instead consider using a library such as an atlas that can optimize the cache size or a vendor library such Like MKL or ACML.

+4


source







All Articles