____cacheline_aligned_in_smp for structure in Linux kernel

Why do many frameworks use macros in the Linux kernel ____cacheline_aligned_in_smp

? Does this help improve performance when accessing the structure? If so, how?

+5


source to share


3 answers


Each cache line in any cache (dcache or icache) has a 64-byte (on x86) architecture. Cache alignment is required to avoid false sharing of cache lines. If cache lines are shared between global variables (which is more common in the kernel). If one of the global variables has been modified by one of the processors in its cache, it marks that cache line as dirty. In the remaining line of the CPU cache, it becomes a stale entry that needs to be flushed and reclaimed from memory. This can lead to missing cache lines, which requires more CPU cycles. This degrades system performance. Remember this is for globals. Most kernel data structures use this to avoid missing cache lines.



+2


source


____cacheline_aligned

instructs the compiler to instantiate the structure or variable at the address corresponding to the beginning of the L1 cache line for a particular architecture, that is, to align the L1 line. ____cacheline_aligned_in_smp

similar, but in fact the L1 line is only aligned when the kernel is compiled in an SMP configuration (i.e. with an option CONFIG_SMP

). They are defined in the include / linux / cache.h file

These definitions are useful for variables (and data structures) that are not dynamically allocated through some allocator, but are compiler-related global variables (a similar effect can be accomplished by dynamic memory allocators that can allocate memory at a specific alignment).

The reason for row-aligned variables in caches is to control the transfer of those variables by cache variables to the cache using hardware cache coherency mechanisms in SMP systems, so that their movement occurs implicitly when other variables are moved. This refers to critical code where you can expect contention in variable access by multiple processors (cores). A common problem to avoid in this case is false sharing.

Variable memory starting at the beginning of the cache line is half the job for this purpose; you also need to "package with it" the variables only that need to move together. An example would be an array of variables that each element of the array must access by only one processor (core):

struct my_data {
   long int a;
   int b;
} ____cacheline_aligned_in_smp cpu_data[ NR_CPUS ];

      



Such a definition would require the compiler (in the kernel SMP configuration) that each cpu structure start at a cache line boundary. The compiler implicitly allocates extra space after each cpu structure so that the next cpu structure will start at a cache line boundary.

An alternative is to pad the data structure with the cache line size of dummy unused bytes:

struct my_data {
   long int a;
   int b;
   char dummy[L1_CACHE_BYTES];
} cpu_data[ NR_CPUS ];

      

In this case, only dummy unused data will inadvertently move, and the ones that are actually accessed by each processor will only move from cache to memory and vice versa, due to cache bandwidth gaps.

+5


source


Linux manages the CPU cache very similar to TLB. Processor caches such as TLB caches take advantage of the fact that programs tend to exhibit link locality. To avoid having to fetch data from main memory for each link, the CPU will cache very small amounts of data in the CPU cache. There are often two levels, called level 1 and level 2 caches. Level 2 caches are larger but slower than L1 cache, but Linux only refers to level 1 or L1 caches.

The processor cache is organized into lines. Each line is usually quite small, typically 32 bytes, and each line is aligned to fit the border. In other words, a 32-byte cache line will be aligned at a 32-byte address. On Linux, the line size L1_CACHE_BYTES

is determined by each architecture.

How addresses are mapped to cache lines differs between architectures, but the mappings fall under three headers, direct mapping , associative mapping, and specified associative mapping . Direct mapping is the simplest approach where each block of memory only maps one possible cache line. With associative mapping, any block of memory can be mapped to any line in the cache. Set an associative display is a hybrid approach where any block of memory can display any row, but only within a subset of the available rows.

Regardless of the mapping scheme, they each have one thing in common: addresses that are close to each other and are consistent with the cache size are likely to use different strings. Hence Linux uses simple tricks to try and maximize cache usage

  • Frequently viewed structure fields are found at the beginning of the structure to increase the likelihood that it only takes one line to address common fields;
  • Unconnected elements in the structure should try to be at least the size of the cache to avoid spurious exchange between processors;
  • Shared cache objects such as the mm_struct cache aligned with the processor L1 cache to avoid spurious swaps.

If the CPU references an address that is not in the cache, cache misses occur and data is fetched from main memory. The cost of cache skips is quite high since a cache link can usually be completed in less than 10 ns, where a main memory link will typically cost between 100 ns and 200 ns. The main goal is to have as many cache requests as possible and as few cache misses as possible.

Just as some architectures do not automatically manage their TLBs, some do not automatically manage their CPU caches. Interceptors are placed in places where the virtual to physical mapping changes, such as during a page table update. Flushing the CPU cache must always take place first, as some processors require the virtual physical mapping to exist when the virtual address is cleared from the cache.

More information here

+1


source







All Articles