Parallelization vs bottlenec vectorization performance: are AVX and MT competing?

Question

Parallelization vs bottlenec vectorization performance: are AVX and MT competing?

I tried to calculate the sum of all the elements in a large matrix. Here are test cases:

MT and AVX take 37 seconds
MT and no AVX takes 40 s
AVX and no MT takes 49 seconds
Neither AVX nor MT 105 s

In all cases, the processor clock is fixed at 3.0 GHz (cpufreq-info stated):

current policy: frequency should be within 1.60 GHz and 3.40 GHz.
                The governor "userspace" may decide which speed to use
                 within this range.
current CPU frequency is 3.00 GHz.

The matrix contains 25,000,000 elements of type double

and a value of 1.0. And the sum is calculated every 4096 times in a cycle. Without AVX, the speed improvement using MT is 2.6. With AVX, it's only 1.3. When MT starts, the matrix is divided into 4 blocks, one per thread. If I decrease the cpu frequency, the MT improvement is greater for AVX, so there might be a cache miss issue, but that can't explain the difference between (4) / (2) and (3) / (1). Do AVX and MT match each other in any way? I3570K chip.

+3

multithreading avx intel

user877329 18 jul. 15 at 12:28

source to share

2 answers

MT shouldn't compete with MT, they are two different things. Although the idea of summing is simple, depending on your implementation, you can end up with very different numbers. I suggest you use Flow Tests to check performance as they are standard. I can't see your code, but there are some problems:

you are initializing the matrix with 1.0 for all elements. I think this is not a good idea. You must use random numbers or initialize the lease depending on the index (e.g. (i% 10) /10.0).
How do you measure time? you have to put your timers out of the repetition loop and take the average of the repetitions. Also do you use accurate timers?
Have you verified that your code is actually vectorized? have you enabled compiler flags to display this information? Are you sure you are using the AVX version of your code? perhaps the compiler decided to use the scalar version.
You mentioned that the frequency is fixed, are you sure that the turbo mode is not enabled at any given time?
How about merging strands when measuring with MT?

0

Salaheldin Saleh 04 Aug '15 at 9:14

source to share

Leeor · Accepted Answer · 2015-07-18T13:01:36+0000

It's possible that your baseline performance was limited by execution latency, but any form of parallelization (MT or vectorization) allowed you to break it down and reach the next bottleneck, which is your processor's BW memory.

Check the peak BW your cpu can achieve and compare to your data, it looks like you are just saturating at 20.5 GB / s (25,000,000 elements * 4096 cycles * 8Bytes, assuming your system is using / ~ 40 seconds to double ), which seems a bit low as this link says it should be as high as 25GB / s, but around the same ball, so it could be due to different inefficiencies like DDR type, other apps / OS running in the background mode, frequency scaling performed by the processor to save power / reduce heat, etc.

You can also try running some memory tests (lmbench, sandra, ..) and see if they improve in the same environment.

Parallelization vs bottlenec vectorization performance: are AVX and MT competing?

More articles: