Parallel nested loop with OpenMP

Question

Parallel nested loop with OpenMP

I have a relatively simple loop where I calculate the net acceleration of a particle system using the brute force method.

I have an OpenMP runtime loop that traverses each particle and compares it to all other particles for n ^ 2 complexity:

   !$omp parallel do private(i) shared(bodyArray, n) default(none)
   do i = 1, n
        !acc is real(real64), dimension(3)
        bodyArray(i)%acc = bodyArray(i)%calcNetAcc(i, bodyArray)
   end do

which works great.

What I am trying to do now is to reduce the computation time by only calculating the force on each body using the fact that the force from F (a-> b) = -F (b-> a), reducing the number of interactions to compute in half (n ^ 2/2). What I am doing in this loop:

call clearAcceleration(bodyArray) !zero out acceleration

    !$omp parallel do private(i, j) shared(bodyArray, n) default(none)
       do i = 1, n
          do j = i, n

             if ( i /= j .and. j > i) then 
                bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
                bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
             end if

          end do
       end do

But I'm having a lot of difficulty with parallelizing this loop, I keep getting junk results. I think it has something to do with this line:

bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc

and that the forces do not add up properly with all the different "j" s written into it. I tried using the atomic operator, but that is not allowed for array variables. So, I've tried it critically, but this increases the time taken by about 20 and still doesn't give the correct results. I also tried adding an ordered statement, but then I just get NaN for all of my results. Is there an easy solution to get this loop to work with OpenMP?

Working code, it has a slight speed improvement, but not the ~ 2x I was looking for.

!$omp parallel do private(i, j) shared(bodyArray, forces, n) default(none) schedule(guided)
      do i = 1, n
         do j = 1, i-1
               forces(j, i)%vec = bodyArray(i)%accTo(bodyArray(j))
               forces(i, j)%vec = -forces(j, i)%vec
         end do
      end do

      !$omp parallel do private(i, j) shared(bodyArray, n, forces) schedule(static)
      do i = 1, n
         do j = 1, n
            bodyArray(i)%acc = bodyArray(i)%acc + forces(j, i)%vec
         end do
      end do

+3

fortran openmp

Exascale 07 Aug '14 at 6:45

source to share

1 answer

High Performance Mark · Accepted Answer · 2014-08-07T08:48:41+0000

With your current approach and data structures, you will struggle to get good acceleration with OpenMP. Consider a loopback socket

!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
   do i = 1, n
      do j = i, n

         if ( i /= j .and. j > i) then 
            bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
            bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
         end if

      end do
   end do

[Actually, before you consider this, rework it like this:

!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
   do i = 1, n
      do j = i+1, n

            bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
            bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc

      end do
   end do

..., back to problems]

There are two problems here:

As you swim, you have a data race update bodyArray(j)%acc

; multiple threads will try to update the same item and there is no coordination of these updates. Unwanted results. Using sections critical

or ordering operators serializes the code; when you fix it, you will also get it just as slow as it was before you started with OpenMP.
Item Access Pattern is bodyArray

not cache related. I would not be surprised if you were faced with data consumption without serializing calculations, the impact of unfriendly cache is to make the code slower than the original. Modern processors are insanely fast to compute, but memory systems struggle to feed the beasts, so the cache effects can be massive. Trying to run two loops at the same time over the same rank-1 array, which is essentially what your code is doing, will never (?) Move data through the cache at the fastest rate possible.

Personally, I would try the following. I won't guarantee that it will be faster, but it will be easier (I think) than setting your current approach and fitting OpenMP like a glove. I have a dubious doubt that this is too difficult, but I had no idea yet.

First create a 2D array of reals, name it forces

where element force(i,j)

is the force the element i

has on j

. Then some code like this (untested what is your responsibility if you want to follow this line)

forces = 0.0  ! Parallelise this if you want to
!$omp parallel do private(i, j) shared(forces, n) default(none)
   do i = 1, n
      do j = 1, i-1
            forces(i,j) = bodyArray(i)%accTo(bodyArray(j)) ! if I understand correctly
      end do
   end do

then we sum up the forces on each particle (and we get the next right, I did not check carefully)

!$omp parallel do private(i) shared(bodyArray,forces, n) default(none)
do i = 1, n
      bodyArray(i)%acc = sum(forces(:,i))
end do

As I wrote above, the computation is very fast, and if you have spare memory, it is often worth spending some time on your spare time.

Now you have probably a load balancing problem in the loop socket over forces

. Most OpenMP implementations will do static distribution of work by default (this is not required by the standard, but seems to be the most common, check your documentation). So stream 1 will get the first lines n/num_threads

to deal with, but these are small and very small lines at the top of the triangle you are calculating. Thread 2 will get more work, thread 3 will get more, etc. You can get away with just adding a clause schedule(dynamic)

to the directive parallel

, you might have to complicate load balancing a bit.

You can also view my code snippets for cache convenience and tweak them if needed. And you may well find, if you do what I suggest, so that you are better off with your source code, that halving the number of computations does not actually save you much time.

Another approach would be to pack the bottom (or top) triangle forces

into a rank-1 array and use some fancy indexing arithmetic to convert 2D (i,j)

to 1D indexes into that array. This will save storage space and it might be easier to simplify caching.

Parallel nested loop with OpenMP

More articles: