Measure L1 data cache misses with punching and papi

What is the difference between PAPI_L1_LDM

papi and L1-dcache-load-misses

perf?

I used the same settings as this post here .

So what I end up with for papi is:

PAPI_L1_DCM: 515 <- L1 data cache miss (probably L1D_READ_MISSES_ALL + L1D_READ_MISSES_RETRIED?)
PAPI_L1_ICM: 300 <- L1 Instruction cache miss
PAPI_L1_LDM: 441 <- L1 Load data miss
PAPI_L1_TCM: 815 <- L1 Total cache miss

      

Sorry, PAPI_L1_DCA

not supported on this computer.

And for perf (only in user space, since papi only measures user space and kernel space): call: perf stat -B -e L1-dcache-load-misses:u,cache-misses:u ./perf

    16,539      L1-dcache-load-misses
       128      cache-misses:u  

      

16.539 seems more reasonable for N=1000000

. What is the difference between load-data-miss error (PAPI_L1_LDM in papi) and missing data cache (PAPI_L1_DCM in folders) and why are these numbers different in papi and perf? Is the cache-misses:u

perforation associated with L2 cache skipping?

edit: Hardware (Xeon E5-2600 v3 family, Haswell-EP 12 kernels)

+3


source to share


1 answer


Some Explanation:

From the PAPI man page you can see that PAPI_L1_LDM

= "number of download misses". In other words, they PAPI_L1_LDM

are misses originating only from loads (and sometimes preliminary shots ).

Load is when your program executes a load command to retrieve memory.

Pre-Fetch is when a process guesses that you are going to load memory in the near future and will fetch it ahead of time.




IN L1-dcache-load-misses

  • L1

    is the level 1 cache, the smallest and fastest. LLC

    , on the other hand, belongs to the last level of the hierarchy , thus denoting the largest but slowest cache.
  • i

    vs. d

    allocates the instruction cache from the data cache. Only L1 is shared in this way, other caches are shared between data and instructions.



You seem to be thinking that cache-misses:u

perf is related to L2 cache leaking. In fact, this is not the case.

The event cache-misses

is a memory access number that cannot be served by any cache.

I admit that perfection documentation is not the best.

However, you can learn a lot about this by reading (if you already have a good knowledge of how the processor and the performance monitoring unit work, this is clearly not a computer architecture course), the perf_event_open () function document :

For example, after reading it, you will see that the event cache-misses

shown by the primary list matchesPERF_COUNT_HW_CACHE_MISSES

  • Next, you may find that it L1-dcache-load-misses

    is a Hardware Cache Event and a cache-misses

    Hardware Event (which is a superset of the Hardware cache event).



And in regards to your difference, you can consult this for a reason that says increasing the size of your array by 100 or even 10000, because it says: I noticed big fluctuations in synchronization results in an otherwise and with a length of 1,000,000, an array almost fits into your L3 cache.

+3


source







All Articles