Measure L1 data cache misses with punching and papi
What is the difference between PAPI_L1_LDM
papi and L1-dcache-load-misses
perf?
I used the same settings as this post here .
So what I end up with for papi is:
PAPI_L1_DCM: 515 <- L1 data cache miss (probably L1D_READ_MISSES_ALL + L1D_READ_MISSES_RETRIED?)
PAPI_L1_ICM: 300 <- L1 Instruction cache miss
PAPI_L1_LDM: 441 <- L1 Load data miss
PAPI_L1_TCM: 815 <- L1 Total cache miss
Sorry, PAPI_L1_DCA
not supported on this computer.
And for perf (only in user space, since papi only measures user space and kernel space): call: perf stat -B -e L1-dcache-load-misses:u,cache-misses:u ./perf
16,539 L1-dcache-load-misses
128 cache-misses:u
16.539 seems more reasonable for N=1000000
. What is the difference between load-data-miss error (PAPI_L1_LDM in papi) and missing data cache (PAPI_L1_DCM in folders) and why are these numbers different in papi and perf? Is the cache-misses:u
perforation associated with L2 cache skipping?
edit: Hardware (Xeon E5-2600 v3 family, Haswell-EP 12 kernels)
source to share
Some Explanation:
From the PAPI man page you can see that PAPI_L1_LDM
= "number of download misses". In other words, they PAPI_L1_LDM
are misses originating only from loads (and sometimes preliminary shots ).
Load is when your program executes a load command to retrieve memory.
Pre-Fetch is when a process guesses that you are going to load memory in the near future and will fetch it ahead of time.
IN L1-dcache-load-misses
-
L1
is the level 1 cache, the smallest and fastest.LLC
, on the other hand, belongs to the last level of the hierarchy , thus denoting the largest but slowest cache. -
i
vs.d
allocates the instruction cache from the data cache. Only L1 is shared in this way, other caches are shared between data and instructions.
You seem to be thinking that cache-misses:u
perf is related to L2 cache leaking. In fact, this is not the case.
The event cache-misses
is a memory access number that cannot be served by any cache.
I admit that perfection documentation is not the best.
However, you can learn a lot about this by reading (if you already have a good knowledge of how the processor and the performance monitoring unit work, this is clearly not a computer architecture course), the perf_event_open () function document :
For example, after reading it, you will see that the event cache-misses
shown by the primary list matchesPERF_COUNT_HW_CACHE_MISSES
- Next, you may find that it
L1-dcache-load-misses
is a Hardware Cache Event and acache-misses
Hardware Event (which is a superset of the Hardware cache event).
And in regards to your difference, you can consult this for a reason that says increasing the size of your array by 100 or even 10000, because it says: I noticed big fluctuations in synchronization results in an otherwise and with a length of 1,000,000, an array almost fits into your L3 cache.
source to share