Why is the runtime different for these fortran 95 loop methods?

Question

Why is the runtime different for these fortran 95 loop methods?

I have a sample program for doing matrix operations in fortran that has a basic column system for storing matrices. Is this the reason for such a significant difference in execution time in the two array operations? If so, can someone explain why this is happening and what exactly is causing such a big difference in runtime?

I am using Ubuntu 14.04 with GNU Fortran 4.8.4.

code:

program main
implicit none

integer :: i,j
real :: start, finish
real,dimension(256,256) :: arr1

!ROW format - put 0 to main diagonal
call cpu_time(start)
do i=1,255,1
    do j=1,255,1
        arr1(i,j)=0
    end do
end do
call cpu_time(finish)

write(*,100) 'Execution time arr(i,j) in seconds= ', (finish-start)
100 format (A,f12.9)

!COLUMN format - put 1 to main diagonal
call cpu_time(start)
do j=1,255,1
    do i=1,255,1
        arr1(i,j)=1
    end do
end do
call cpu_time(finish)

write(*,100) 'Execution time arr(j,i) in seconds = ', (finish-start)

end program

Compile:

gfortran main.f95 -o main

Output:

Execution time arr(i,j) in seconds=  0.000736000
Execution time arr(j,i) in seconds =  0.000164000

The first method takes about 4.5 times the execution time compared to the second method.

Edit: I'm more interested in knowing why there is such a big difference in runtime (something weird happens at the compiler or at the cpu or at the memory level when we do line ordering etc.) rather than just adding a flag -o3

or optimizing the code There is an answer to this do do loop optimization question that says it is best to order the columns. Why is that?

+3

loops fortran fortran95

343_458 08 Aug 15 at 20:35

source to share

2 answers

To work with data from the CPU, you need to read it from RAM into its cache. It takes as long to read one byte as it does to read quite a lot of consecutive bytes.

If your iner loop is over a non-contiguous size, the CPU must read and write each individual value from and to RAM independently. If your inner loop is over a contiguous dimension, it can read many values in one go and then work with them in the cache.

+1

chw21 09 Aug '15 at 3:10

source to share

innoSPG · Accepted Answer · 2015-08-09T04:11:58+0000

First of all, your test is highly biased: To see the bias, change the order of the two blocks you are testing, things will start to change. For such a test, you must:

write two different programs: one for each case;
run each program several times and take the average time;

You can also replace the second step with a loop, depending on what interests you.

Now, returning to your concern, I first mentioned that the question is too broad, as Franssalcus mentioned. Shorten history; Computer memory is organized in a hierarchy:

RAM;
Cache (there can be many levels, for simplicity we will consider one level);
Registers

Each level has its own advantages and disadvantages:

RAM can be large (giga bytes), but very slow (about tens of nano seconds, from 50 to 70).
The cache is not very large (from a few kilobytes to several megabytes), faster than RAM (a few nano seconds from 0.5 to 2.5).
the register is small (only a few tens of bytes), very fast.

For more information see this link . I ignored disks, which are another layer of memory, as well as networking.

Typically, data is transferred from only one memory level to another: value from RAM to cache and from cache to RAM, from cache to register and from register to cache. The CPU only works with registers that are faster available. Therefore, for each operation, data is output from RAM to a register, and after being calculated, it is returned to RAM. Oh no, not so fast. Let's keep it simple and say that the processor operates on bytes (if you go deeper, you will learn the concept of words, which are a group of contiguous bytes, and the concept of pages, which is a group of contiguous words).

When you access a byte that is not yet in the cache, a cache error occurs, that byte goes first from RAM to cache and then goes into a register for your operation. When the system takes this byte from RAM into the cache, it takes a group of contiguous bytes together. So if the next operation is on the neighbor itself, there is no need to move to RAM.

What's happening in your program right now is that fortran stores the array column by column, which means that the memory elements are stored in the following order:

a(1,1) a(2,1) a(3,1) ... a(M,1) a(1,2) a(2,2) a(3,2) ... a(M,2) ...

So the cycle

do j=1,255,1
    do i=1,255,1
        arr1(i,j)=1
    end do
end do

refers to elements in the order in which they are stored in memory. The number of trips between RAM and cache is reduced to a minimum.

For another cycle

do i=1,255,1
    do j=1,255,1
        arr1(i,j)=1
    end do
end do

You just don't access the elements in the correct order. For example, if your cache can contain less than a column of your matrix, this means that for any iteration of the inner loop, the system must refill the cache. And it is not so easy to replenish the cache, the system will first copy the data in the cache to RAM, if they have been changed, which is what happens here. To see this, increase the size of the matrix to the maximum size that your RAM can handle, and you will see what it means to not follow the storage logic, this gap will widen. You can go gradually, 1000x1000, then 10000x10000, etc. When the cache can only contain one column or less, you will get a ratio close to the period between RAM access time and bowl. Remember more than 10.

Memory is the subject of many computer science courses. I wanted to give you only what I can give quickly.

Why is the runtime different for these fortran 95 loop methods?

More articles: