Cache management

Question

Cache management

I've read about caching on multi-core systems and I'm wondering if it is possible to manage some pages in the cache or not when we are programming in C / C ++.

For example, I know we can use the builtin __ builtin_prefetch function to move data into the cache to reduce cache misses and thus latency. I found it here: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Other-Builtins.html#Other-Builtins

I also found this for x86 and x86-64 Intel:

#include <xmmintrin.h>
enum _mm_hint
{
  _MM_HINT_T0 = 3,
  _MM_HINT_T1 = 2,
  _MM_HINT_T2 = 1,
  _MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);

Here: http://lwn.net/Articles/255364/

What other functions can we use to give us some "control" of the cache. For example, can we do anything related to replacing the cache page? Or is this an exclusive feature of the OS?

Thank you so much!

+3

c caching multicore

franco 05 nov. 14 at 11:21

source to share

2 answers

Other aspects of caching operations are transient and invalid:

Flushing means writing updates pending in caches in RAM. A flush may or may not be followed by an invalid.
Invalid means that cache line markings are empty. An invalid is rarely used without a preceding flush.

On x86, these operations are usually performed transparently by caches as needed. They can also be called as programmer instructions. See Relevant Instruction Set Links.

0

Olof forshell 11 nov. '14 at 9:48

source to share

Blagovest Buyukliev · Accepted Answer · 2014-11-05T11:37:18+0000

cache prefetch hints usually generate a special instruction prefetch

that advises the prefetcher that this piece of memory will be needed in the near future. The prefect may (or may not) accept this advice. So in this sense, software prefetching is not really about "cache management" or "cache management".

To the best of my knowledge, no existing widespread instruction set architecture provides instructions to evict a specific cache line or, for example, to force a specific portion of memory into a cache line. The cache point in most modern architectures should be transparent to the programmer.

However, you can write programs that are not cache specific. It is about the spatial and temporal locality of both data and instructions:

Spatial locality with respect to data means that you should aim for sequential memory accesses that do not stray too far apart. This is the most natural thing for optimization.
Spatial locality in relation to instructions means that jumps and branches should not go too far into the code. Compilers and linkers should try to do this.
Temporal locality in relation to data means accessing the same memory locations (although perhaps not close to each other) in a single time slice. Determining how long this piece of time can be tricky,
Temporal locality with respect to instructions means that even if the code jumps long distances, it jumps over the same places in the same chunk of time. This is generally very unintuitive and not very useful for optimization.

Generally, you should optimize the data, not so much the location of the instruction. In most performance-sensitive programs, the amount of data far exceeds the amount of code.

Also, as far as multiple cores are concerned, you should try to avoid false swaps and make the most of streamed local storage. Keep in mind that each CPU core has its own dedicated cache, and things like hijacking a cache line between core caches can have a very negative effect.

To illustrate false sharing, consider the following code:

int counts[NUM_THREADS]; // a global array where each thread writes to its slot

...

for (int i = 0; i < NUM_THREADS; ++i) {
    spawn_thread(thread_start);
}

...

void thread_start(void)
{
    for (a_large_number_of_iterations) {
        int some_condition = some_calculation();

        if (some_condition) {
            counts[THREAD_ID]++;
        }
    }
}

Each of the threads changes the array element counts

at a high rate. The problem is that the individual elements of the array are contiguous, and large groups of them will fall on the same cache line. With a typical cache line of 64 bytes and a typical size of int

4 bytes, this means that one cache line can create room for 16 items. When multiple cores only update their 4-byte count, they are also invalidated for the corresponding cache line in other cores, causing the cross-core cache line to be rolled back, although threads appear to use independent memory locations.

Cache management

More articles: