Disadvantages of Linux report cache for unexpected instruction
I am trying to apply some performance specifications to implement Dijkstra's algorithm. In an attempt to find bottlenecks in a (naive and unoptimized) program, I use a command perf
to record the number of cache misses. The snippet of code that matters is the following, it finds the invisible node with the smallest distance:
for (int i = 0; i < count; i++) {
if (!visited[i]) {
if (tmp == -1 || dist[i] < dist[tmp]) {
tmp = i;
}
}
}
The LLC-load-misses
perf report
following assembly annotation is shown for the metric :
โ for (int i = 0; i < count; i++) { โ
1.19 โ ff: add $0x1,%eax โ
0.03 โ102: cmp 0x20(%rsp),%eax โ
โ โ jge 135 โ
โ if (!visited[i]) { โ
0.07 โ movslq %eax,%rdx โ
โ mov 0x18(%rsp),%rdi โ
0.70 โ cmpb $0x0,(%rdi,%rdx,1) โ
0.53 โ โ jne ff โ
โ if (tmp == -1 || dist[i] < dist[tmp]) { โ
0.07 โ cmp $0xffffffff,%r13d โ
โ โ je fc โ
0.96 โ mov 0x40(%rsp),%rcx โ
0.08 โ movslq %r13d,%rsi โ
โ movsd (%rcx,%rsi,8),%xmm0 โ
0.13 โ ucomis (%rcx,%rdx,8),%xmm0 โ
57.99 โ โ jbe ff โ
โ tmp = i; โ
โ mov %eax,%r13d โ
โ โ jmp ff โ
โ } โ
โ } โ
โ }
Now my question is, why is the command jbe
generating so many cache misses? This instruction shouldn't fetch anything from memory at all, if I'm not mistaken. I figured it might have something to do with instruction cache misses, but even measuring only L1 data cache misses using it L1-dcache-load-misses
shows that there are many cache misses in this instruction.
It brings me down a little. Can anyone explain this (in my opinion) strange result? Thank you in advance.
source to share
About your example:
There are several instructions on the top counter:
โ movsd (%rcx,%rsi,8),%xmm0
0.13 โ ucomis (%rcx,%rdx,8),%xmm0
57.99 โ โ jbe ff
movsd loads a word from (%rcx,%rsi,8)
(some array access) into xmm0, and ucomis loads another word from (%rcx,%rdx,8)
and compares it to the value just loaded in xmm0. "jbe" is a conditional jump that depends on the result of the comparison.
Many modern Intel processors (and AMD, perhaps, too) can and can combine multiple combinations of operations (realworldtech.com/nehalem/5) into a single uop, CMP + JCC ") together and the conditional branch cmp + is a very general command combination that needs fusion (you can test it with a simulation tool Intel IACA
, use version 2.1 for your processor.) A fusion pair may report incorrectly in the primary / PMU / PEBS, skewing most events towards one of the two instructions.
This code probably means that the expression "dist [i] <dist [tmp]" generates two memory accesses, and both values โโare used in an instruction ucomis
that is (partially?) Merged with a jbe
conditional jump. Either dist [i] or dist [tmp], or both generate a lot of misses. Any such gaps block ucomis
to generate a result and block jbe
to give the next instruction to execute (or to remove predicted instructions). So jbe
can get all the glory of high counters instead of real memory access instructions (and for a "distant" event like a cache response, there is some skew to the last blocked instruction).
You can try to concatenate the visited [N] and dist [N] arrays into an array [N] of struct { int visited; float dist}
to force prefetch array[i].dist
when accessing, array[i].visited
or you can try to reorder the vertex access, or renumber the graph vertex, or software prefetch for the next one or more items (?)
About the general perf
event by problem names and possible rollback.
perf
(perf_events) on Linux uses a predefined set of events when called perf list
, and some of the listed hardware events may not be implemented; others are mapped to current CPU capabilities (and some mappings are not entirely correct). Some basic information about the real PMU is at https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details on the corresponding Nehalem-EP variant).
For your Nehalem (Intel Core i5 750 with 8MB L3 cache and no multi-cpu / multi-socket / NUMA support) perf will display the standard ( "General cache events" ) LLC-load-misses
event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation on primary event mappings (only one) - kernel source
http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103
u64 nehalem_hw_cache_event_ids ...
[ C(LL ) ] = {
[ C(OP_READ) ] = {
/* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
/* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
[ C(RESULT_MISS) ] = 0x01b7,
...
/*
* Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
* See IA32 SDM Vol 3B 30.6.1.3
*/
#define NHM_DMND_DATA_RD (1 << 0)
#define NHM_DMND_READ (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
u64 nehalem_hw_cache_extra_regs
..
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
[ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_L3_MISS,
I think this event is inaccurate: the cpu pipeline will send (unmanaged) a load request to the cache hierarchy and execute other instructions. After some time ( about 10 cycles to receive and receive a response from L2 and 40 cycles to reach L3 ), there will be a response with a skip flag in the corresponding (offcore?) PMU to increase the counter. In this counter overflow, a profiling interrupt will be generated from this PMU. In a few CPU clock cycles it will reach the pipeline to interrupt it, the perf_events subsystem handler will handle this by registering the current (interrupted) EIP / RIP instruction pointer and reset the PMU counter back to some negative value (e.g. -100000 to receive an interrupt for every 100000 skips L3, useperf record -e LLC-load-misses -c 100000
to set the exact amount or the perpendicular so that the autotune limits to get some default frequency). The registered EIP / RIP is not a boot IP command, and it may also not be an EIP / RIP command that wants to use the downloaded data.
But if your cpu is the only socket on the system and you are accessing normal memory (not some PCI-Express mapped space), the L3 miss will actually be implemented as local memory access, and there are some counters for that ... ( https://software.intel.com/en-us/node/596851 - "Any memory requests missing here should be served by local or remote DRAM").
There are several PMU event lists for your processor:
-
Intel Official Guidelines for Intelยฎ 64 and IA-32 Developers (SDM): https://software.intel.com/en-us/articles/intel-sdm Volume 3 Appendix A
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "18.8 PROCESSOR PERFORMANCE MONITORING BASED ON INTELยฎ MICROARCHITECTURE CODE NAME NEHALEM" from page . 213 "Vol 3B 18 -35"
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "19.8 - Processors Based on Intelยฎ Nehalem Microarchitecture" on page 365 and " Volume 3B 19-61 ")
- Some other volume to encode Offcore's answer? Volume 3A 18-26?
-
from oprofile http://oprofile.sourceforge.net/docs/intel-corei7-events.php
- from libpfm4
showevtinfo
http://www.bnikolic.co.uk/blog/hpc-prof-events.html (note this Sandy Bridge listing page, run libpfm4 ant on your PC to get your list). There is also a toolcheck_events
in libpfm4 to help your encoding event as raw forperf
. - from VTune documentation: http://www.hpc.ut.ee/dokumendid/ips_xe_2015/vtune_amplifier_xe/documentation/en/help/reference/pmw_sp/events/offcore_response.html
- from Neualem PMU manual: https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf
-
ocperf
Intel perf developer tool Andi Kleen, part of his pmu-tools https://github.com/andikleen/pmu-tools .ocperf
is just a wrapper forperf and this package will download event description and any supported event name will be converted into correct raw encoding of
perf`.
There should be some information about ANJ_LLC_MISS offcore. PMU event implementation and PEBS event list for Nhm, but I can't find it now.
I can recommend that you use ocperf
from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without having to manually code them. There are several PEBS events in your cpu, and there is Latency / profiling perf mem
for some sort of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access fetch support , RH 2013 - pg26-30 , not registered yet in 2015 - sowa pg19 , ls /sys/devices/cpu/events
) There are newer tools for new processors like ucevent .
I can also advise you to try cachegrind
modeling tool profiler / cache the program valgrind
with kcachegrind
a graphical interface to view profiles. Valgrind-based profilers can help you get an idea of โโhow the code works: they collect the exact number of commands for each command, and cachegrind also mimics some abstract layered cache. But the real CPU will execute several instructions per clock (so callgrind
/ cachegrind
1 instruction cost model = 1 CPU clock pulse gives some error, the cache model of the cache doesn't have the same logic as the real cache). And all the tools valgrind
are dynamic binary instrumentation tools that slow down your program by 20-30 times when compared to normal startup.
source to share
When you read a memory location, the processor will try to prefetch adjacent memory locations and cache them.
This works well if you are reading an array of objects that are all allocated in memory in contiguous blocks.
However, if, for example, you have an array of pointers that live on the heap, it is less likely that you will be iterating over contiguous chunks of memory unless you use some kind of custom allocator specifically designed to do so.
Because of this, dereferencing should be considered a cost of some sort. An array of structures can be more efficient for an array of pointers to structures.
Herb Sutter (C ++ committee member) talks about it in this conversation https://youtu.be/TJHgp1ugKGM?t=21m31s
source to share