Does the hardware bundle multiple code operations into a single physical CPU operation?

Question

Does the hardware bundle multiple code operations into a single physical CPU operation?

I read a 2006 article on how processors perform operations on whole l1 cache lines even in situations where you only need to do something with a small fraction of what the l1 line contains (e.g. loading an entire l1 line to write to a boolean variable is obviously redundant). This article recommends optimization by memory management in the friendly l1 cache.

Let's say I have two variables int

that just happen sequentially in memory, and in my code I write them both sequentially.

Is the hardware of my two code operations related in one physical operation on the same l1 line (assuming the processor has a l1 lithium line sufficient to hold both variables) or not?

Is there a way to suggest such a thing for a processor in C ++ or C?

If the hardware doesn't do the consolidation in any way, do you think it can improve performance if such a thing is implemented in code? Allocating a block of memory that is the size of line l1 and filling it with as many hot data variables as possible?

+3

c ++ optimization c cpu-cache consolidation

BMC 09 oct. 14 at 6:59

source to share

3 answers

This is a fairly broad question, but I'll try to cover the main points.

Yes, reading data into the cache ONLY looking at one is a bool

bit wasteful - however, as a rule, the processor DOESN'T KNOW what you plan to do after that if you need the next sequential value for example or not. You can rely on data that is in the same class or structure to be adjacent / adjacent to each other, so using this to store data that you often share with each other will give you that advantage.

With regard to working with "more than one piece of data at the same time", most modern processors have various forms of extensions to perform the same operation on multiple items of data (SIMD is the same instruction, multiple data). It started with MMX in the late 1990s and has been expanded to include 3DNow !, SSE and AVX for x86. ARM has a "Neon" extension that also provides similar functionality. PowerPC also has something similar, whose name eludes me at the moment.

It is not possible for a C or C ++ program to immediately control instruction selection or cache usage. But modern compilers, given the right options, will generate code that, for example, uses SIMD commands to sum everything int

in a larger array, adding 4 elements at a time, and then when the whole batch is done, add 4 values horizontally. Or, if you have a set of X, Y, Z coordinates, it can use SIMD to add two sets of such data together. Choosing this compiler is a choice, but it can save you a lot of time, so the optimizers in the compiler change to find cases where it helps and to use these types of instructions.

And finally, most large modern processors (x86 since 1995, ARM A15, PowerPC) also perform superscalar execution - simultaneously execute more than one instruction and "fail" (the processor understands the instruction dependencies and executes those "ready" to execute, not exactly in the order they were passed to the processor). The compiler will find out about this and will try to "help" organize the code so that the processor gets an easy task.

+2

Mats Petersson 09 oct. 14 at 7:39

source to share

The whole point of caching is to allow many operations to be performed quickly with localized memory.

The fastest operations are register related, of course. The only latency associated with using them is fetching, decoding, and executing instructions. Some architecture-rich registers (and vector processors) actually use them as a dedicated cache. And all but the slowest processors have one or more cache levels, which is similar to memory to regular instructions, except for the faster ones.

To simplify with respect to real processors, consider a hypothetical processor that runs at 2 GHz (0.5 ns per clock), with memory that takes 5 ns to load an arbitrary 64-bit (8-byte) word of memory, but only 1 ns for loading each subsequent 64-bit word from memory. (Let's also assume that the entries are similar.) On such a machine, flipping through bits in memory is quite slow: 1 ns to load an instruction (only if it is not already in the pipeline, but 5 ns after the remote branch), 5 ns to load a word containing bit, 0.5 ns for command execution and 5 ns for writing the modified word back to memory. Most likely a copy of memory: roughly zero for loading instructions (since the pipeline seems to be doing the right thing with instruction loops), 5 ns for loading the first 8 bytes, 0,5 ns for instruction execution, 5 ns for the first 8 bytes, and 1 + 0.5 + 1 ns for every additional 8 bytes. The terrain makes life easier. But some operations can be pathological: incrementing each byte of the array does a 5 ns initial load, a 0.5 ns instruction, an initial 5 ns storage, then 1 + 0.5 + 1 per byte (not per word) thereafter. (A copy of a memory that does not fall on the same word boundaries is also bad news.)which does not fall on the same word boundaries is also bad news.)which does not fall on the same word boundaries is also bad news.)

To speed up this processor, we can add a cache that improves workloads and storage by up to 0.5 ns during command execution time for data that is in the cache. The copy of memory does not improve when read, as it still costs 5 ns for the first operation with 8 bytes and 1 ns for extra words, but writes get much faster: 0.5 ns for each word until the cache fills up , and for normal 5 + 1 + 1, etc. After filling, in parallel with other work that uses less memory. Byte increments improve to 5 ns for bootstrap, 0.5 + 0.5 ns for command and write, then 0.5 + 0.5 + 0.5 ns for each additional byte, except when cache caches are read or are recorded. More repetitions of the same addresses increases the cache hit rate.

What happens to real processors, multiple cache levels, etc.? Simple answer: things get complicated. Writing code that maintains the cache code includes trying to improve the locality of memory access, parsing to avoid cache bypass, and a lot of profiling.

+2

Steve 09 oct. 14 at 10:15

source to share

edA-qa mort-ora-y · Accepted Answer · 2014-10-09T07:43:05+0000

The cache line size is primarily related to concurrency. It is the smallest block of data that can be synchronized across multiple processors.

Also, as you would imagine, the entire cache line needs to be loaded to complete the operation in just a few bytes. If you are doing multi-level operations with the same processor, although it does not need to be constantly rebooted. This is actually cached, as the name suggests. This includes caching records to data. As long as only one processor is accessing data, you can usually be sure that it is doing it efficiently.

In cases where multiple processors are accessing data, it can be useful to align the data. Using C ++ attribute alignas

or compiler extensions can help you get data structures that are aligned the way you want.

You might be interested in my article Processor Reordering - What's Really Reordered? , which provides some clues about what is happening (at least logically) at a low level.

Does the hardware bundle multiple code operations into a single physical CPU operation?

More articles: