Hyperthreading having problems with AVX?

During overclocking and running the overclocking tests, I noticed that the AVX optimized version of LINPACK measures lower multipoint floating point throughput when Hyperthreading is enabled than when disabled. It was on the Ivy Bridge i7 (3770k). I also noticed that with Hyperthreading disabled, LINPACK resulted in higher core temperatures even though I was starting the CPU at a lower core voltage. All of this makes me think that without Hyperthreading, the use of piping is actually higher.

I'm curious: is this just something inherent in the LINPACK algorithm that causes pipelined kiosks or inefficient allocation with SMT, or does Intel's SMT implementation legitimately have pipeline scheduling problems when both threads issue wide SIMD instructions? If so, is this what Haswell decided, or what will be addressed in future Intel architectures? Is this a problem that AVX512 tends to have?

Finally, are there any good steps that can be taken while programming with AVX for Intel systems that would avoid inefficient piping distribution with SMT?

+3


source to share


1 answer


Hyperthreading shares external order resources between two hardware threads, instead of giving them to all of one thread. You usually expect to see no speedup in the worst case if one thread can already fill the pipeline completely. In any case, the actuators should chew after 4 hours / hours of the instructions to be followed.

If each thread is running on its own block of memory, the CPU cores then try to manipulate more live data at the same time. The concurrent split of L1 / L2 caches means it can be worse than without HT.

Additionally, some workloads have parallelization overhead. Only embarrassingly parallel problems (like doing many independent maths rather than parallelizing one big one) have little overhead for coordinating threads.



As Agner Fogh says in his Optimization Guide , if any of the concurrently partitioned or partitioned CPU resources is a bottleneck, t help, and can hurt. This is great when the code spends a lot of time on branch mispredictions or cache misses, since another thread can hold on to a hungry pipeline without waiting.

Matrix mathematics has predictable sufficient access patterns in which cache flaws and mispredictions are rare. (especially in code that's carefully locked for cache sizes.)

Avoiding HT does not help: make your code slow so a single thread cannot execute it efficiently enough to maintain a full pipeline. >. & L ;. Seriously though, if there is an algorithm with cache skipping or branch mispredictions that perform the same compared to brute force, using that might help. for example, tests for early outs can be almost washed away given the overhead of mishandling a branch on a single thread, but can be much faster when your code is running on two HW threads of the same kernel, so the brute-force approach is at a disadvantage level.

+5


source







All Articles