Disadvantages of Linux report cache for unexpected instruction

Question

Disadvantages of Linux report cache for unexpected instruction

I am trying to apply some performance specifications to implement Dijkstra's algorithm. In an attempt to find bottlenecks in a (naive and unoptimized) program, I use a command perf

to record the number of cache misses. The snippet of code that matters is the following, it finds the invisible node with the smallest distance:

for (int i = 0; i < count; i++) {
    if (!visited[i]) {
        if (tmp == -1 || dist[i] < dist[tmp]) {
            tmp = i;
        }
    }
}

The LLC-load-misses

perf report

following assembly annotation is shown for the metric :

       │             for (int i = 0; i < count; i++) {                                                                                                                           ▒
  1.19 │ ff:   add    $0x1,%eax                                                                                                                                                  ▒
  0.03 │102:   cmp    0x20(%rsp),%eax                                                                                                                                            ▒
       │     ↓ jge    135                                                                                                                                                        ▒
       │                 if (!visited[i]) {                                                                                                                                      ▒
  0.07 │       movslq %eax,%rdx                                                                                                                                                  ▒
       │       mov    0x18(%rsp),%rdi                                                                                                                                            ◆
  0.70 │       cmpb   $0x0,(%rdi,%rdx,1)                                                                                                                                         ▒
  0.53 │     ↑ jne    ff                                                                                                                                                         ▒
       │                     if (tmp == -1 || dist[i] < dist[tmp]) {                                                                                                             ▒
  0.07 │       cmp    $0xffffffff,%r13d                                                                                                                                          ▒
       │     ↑ je     fc                                                                                                                                                         ▒
  0.96 │       mov    0x40(%rsp),%rcx                                                                                                                                            ▒
  0.08 │       movslq %r13d,%rsi                                                                                                                                                 ▒
       │       movsd  (%rcx,%rsi,8),%xmm0                                                                                                                                        ▒
  0.13 │       ucomis (%rcx,%rdx,8),%xmm0                                                                                                                                        ▒
 57.99 │     ↑ jbe    ff                                                                                                                                                         ▒
       │                         tmp = i;                                                                                                                                        ▒
       │       mov    %eax,%r13d                                                                                                                                                 ▒
       │     ↑ jmp    ff                                                                                                                                                         ▒
       │                     }                                                                                                                                                   ▒
       │                 }                                                                                                                                                       ▒
       │             }

Now my question is, why is the command jbe

generating so many cache misses? This instruction shouldn't fetch anything from memory at all, if I'm not mistaken. I figured it might have something to do with instruction cache misses, but even measuring only L1 data cache misses using it L1-dcache-load-misses

shows that there are many cache misses in this instruction.

It brings me down a little. Can anyone explain this (in my opinion) strange result? Thank you in advance.

+3

performance linux caching perf

StephenSwat 05 May '17 at 12:02

source to share

2 answers

When you read a memory location, the processor will try to prefetch adjacent memory locations and cache them.

This works well if you are reading an array of objects that are all allocated in memory in contiguous blocks.

However, if, for example, you have an array of pointers that live on the heap, it is less likely that you will be iterating over contiguous chunks of memory unless you use some kind of custom allocator specifically designed to do so.

Because of this, dereferencing should be considered a cost of some sort. An array of structures can be more efficient for an array of pointers to structures.

Herb Sutter (C ++ committee member) talks about it in this conversation https://youtu.be/TJHgp1ugKGM?t=21m31s

0

arboreal84 05 May '17 at 9:29

source to share

osgx · Accepted Answer · 2017-05-05T02:22:39+0000

About your example:

There are several instructions on the top counter:

        │       movsd  (%rcx,%rsi,8),%xmm0
   0.13 │       ucomis (%rcx,%rdx,8),%xmm0
  57.99 │     ↑ jbe    ff

movsd loads a word from (%rcx,%rsi,8)

(some array access) into xmm0, and ucomis loads another word from (%rcx,%rdx,8)

and compares it to the value just loaded in xmm0. "jbe" is a conditional jump that depends on the result of the comparison.

Many modern Intel processors (and AMD, perhaps, too) can and can combine multiple combinations of operations (realworldtech.com/nehalem/5) into a single uop, CMP + JCC ") together and the conditional branch cmp + is a very general command combination that needs fusion (you can test it with a simulation tool Intel IACA

, use version 2.1 for your processor.) A fusion pair may report incorrectly in the primary / PMU / PEBS, skewing most events towards one of the two instructions.

This code probably means that the expression "dist [i] <dist [tmp]" generates two memory accesses, and both values are used in an instruction ucomis

that is (partially?) Merged with a jbe

conditional jump. Either dist [i] or dist [tmp], or both generate a lot of misses. Any such gaps block ucomis

to generate a result and block jbe

to give the next instruction to execute (or to remove predicted instructions). So jbe

can get all the glory of high counters instead of real memory access instructions (and for a "distant" event like a cache response, there is some skew to the last blocked instruction).

You can try to concatenate the visited [N] and dist [N] arrays into an array [N] of struct { int visited; float dist}

to force prefetch array[i].dist

when accessing, array[i].visited

or you can try to reorder the vertex access, or renumber the graph vertex, or software prefetch for the next one or more items (?)

About the general perf

event by problem names and possible rollback.

Tool

perf

(perf_events) on Linux uses a predefined set of events when called perf list

, and some of the listed hardware events may not be implemented; others are mapped to current CPU capabilities (and some mappings are not entirely correct). Some basic information about the real PMU is at https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details on the corresponding Nehalem-EP variant).

For your Nehalem (Intel Core i5 750 with 8MB L3 cache and no multi-cpu / multi-socket / NUMA support) perf will display the standard ( "General cache events" ) LLC-load-misses

event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation on primary event mappings (only one) - kernel source

http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103

 u64 nehalem_hw_cache_event_ids ...
[ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
        [ C(RESULT_ACCESS) ] = 0x01b7,
        /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
        [ C(RESULT_MISS)   ] = 0x01b7,
...
/*
 * Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
 * See IA32 SDM Vol 3B 30.6.1.3
 */
#define NHM_DMND_DATA_RD    (1 << 0)
#define NHM_DMND_READ       (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
 u64 nehalem_hw_cache_extra_regs
  ..
 [ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
        [ C(RESULT_MISS)   ] = NHM_DMND_READ|NHM_L3_MISS,

I think this event is inaccurate: the cpu pipeline will send (unmanaged) a load request to the cache hierarchy and execute other instructions. After some time ( about 10 cycles to receive and receive a response from L2 and 40 cycles to reach L3 ), there will be a response with a skip flag in the corresponding (offcore?) PMU to increase the counter. In this counter overflow, a profiling interrupt will be generated from this PMU. In a few CPU clock cycles it will reach the pipeline to interrupt it, the perf_events subsystem handler will handle this by registering the current (interrupted) EIP / RIP instruction pointer and reset the PMU counter back to some negative value (e.g. -100000 to receive an interrupt for every 100000 skips L3, useperf record -e LLC-load-misses -c 100000

to set the exact amount or the perpendicular so that the autotune limits to get some default frequency). The registered EIP / RIP is not a boot IP command, and it may also not be an EIP / RIP command that wants to use the downloaded data.

But if your cpu is the only socket on the system and you are accessing normal memory (not some PCI-Express mapped space), the L3 miss will actually be implemented as local memory access, and there are some counters for that ... ( https://software.intel.com/en-us/node/596851 - "Any memory requests missing here should be served by local or remote DRAM").

There are several PMU event lists for your processor:

Intel Official Guidelines for Intel® 64 and IA-32 Developers (SDM): https://software.intel.com/en-us/articles/intel-sdm Volume 3 Appendix A
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "18.8 PROCESSOR PERFORMANCE MONITORING BASED ON INTEL® MICROARCHITECTURE CODE NAME NEHALEM" from page . 213 "Vol 3B 18 -35"
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "19.8 - Processors Based on Intel® Nehalem Microarchitecture" on page 365 and " Volume 3B 19-61 ")
- Some other volume to encode Offcore's answer? Volume 3A 18-26?
from oprofile http://oprofile.sourceforge.net/docs/intel-corei7-events.php
from libpfm4 showevtinfo

http://www.bnikolic.co.uk/blog/hpc-prof-events.html (note this Sandy Bridge listing page, run libpfm4 ant on your PC to get your list). There is also a tool check_events

in libpfm4 to help your encoding event as raw for perf

.
from VTune documentation: http://www.hpc.ut.ee/dokumendid/ips_xe_2015/vtune_amplifier_xe/documentation/en/help/reference/pmw_sp/events/offcore_response.html
from Neualem PMU manual: https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf
ocperf

Intel perf developer tool Andi Kleen, part of his pmu-tools https://github.com/andikleen/pmu-tools . ocperf

is just a wrapper for perf and this package will download event description and any supported event name will be converted into correct raw encoding of

perf`.

There should be some information about ANJ_LLC_MISS offcore. PMU event implementation and PEBS event list for Nhm, but I can't find it now.

I can recommend that you use ocperf

from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without having to manually code them. There are several PEBS events in your cpu, and there is Latency / profiling perf mem

for some sort of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access fetch support , RH 2013 - pg26-30 , not registered yet in 2015 - sowa pg19 , ls /sys/devices/cpu/events

) There are newer tools for new processors like ucevent .

I can also advise you to try cachegrind

modeling tool profiler / cache the program valgrind

with kcachegrind

a graphical interface to view profiles. Valgrind-based profilers can help you get an idea of how the code works: they collect the exact number of commands for each command, and cachegrind also mimics some abstract layered cache. But the real CPU will execute several instructions per clock (so callgrind

/ cachegrind

1 instruction cost model = 1 CPU clock pulse gives some error, the cache model of the cache doesn't have the same logic as the real cache). And all the tools valgrind

are dynamic binary instrumentation tools that slow down your program by 20-30 times when compared to normal startup.

Disadvantages of Linux report cache for unexpected instruction

More articles: