PERF STAT does not account for memory loads, but does count memory

Linux kernel: 4.10.0-20-generic (also tried this on 4.11.3)

Ubuntu: 17.04

I am trying to collect memory access statistics using perf stat

. I can collect statistics for memory stores, but counting for memory loads returns me a value of 0 .

Following are the details of memory stores: -

perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

       158,115,510      cpu/mem-stores/u                                            

       0.559922797 seconds time elapsed

      

For memory loads, I get counter 0 as shown below: -

perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

                 0      cpu/mem-loads/u                                             

       0.563806170 seconds time elapsed

      

I don't understand why this is considered wrong. Do I have to use another event in some way to get the correct data ?

+4


source to share


2 answers


I used a Broadwell server server (CPU e5-2620) to collect all of the events listed below.

To collect memory load events, I had to use the numeric event value. I basically ran the following command -

./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20

      

Here r81d0 is a raw event for counting "memory load between all instructions removed". "u" is understood to represent user space.



The next command, on the other hand,

./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20

      

has "r82d0: u" as a raw event representing "memory stores among all instructions deleted in user space".

+1


source


The event is mem-loads

MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3

displayed on MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3

performance monitoring MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3

on Intel processors. Events MEM_TRANS_RETIRED.LOAD_LATENCY_*

are special and can only be counted using a modifier p

. That is, you must specify mem-loads:p

to perf to use the event correctly.

MEM_TRANS_RETIRED.LOAD_LATENCY_*

is an exact event and it only makes sense to count at an exact level. According to this Intel article (emphasis mine):

When the user selects to fetch one of these events, special hardware is used that can track the data load from problem to completion. This is more complex than just counting event instances (as with normal event-based fetching) and therefore only some of the loads are monitored . The loads are randomly selected, the delay is determined for each, and the correct events are increased (delay> 4,> 8,> 16, etc.). Due to the sampling nature of this event, only a small percentage of the application data load can be tracked at any given time .

As you can see, by MEM_TRANS_RETIRED.LOAD_LATENCY_*

no means does it take into account the total number of loads, and it is not intended for this purpose at all.



If you want to determine which instructions in your code issue load requests that take more than a certain number of cycles to execute, then this MEM_TRANS_RETIRED.LOAD_LATENCY_*

is the correct performance event to use. In fact, this is precisely the goal, perf-mem

and it achieves its goal with the help of this event .

If you want to calculate the total L1-dcache-loads

, you should use L1-dcache-loads

one that is mapped to MEM_UOPS_RETIRED.ALL_LOADS

performance MEM_UOPS_RETIRED.ALL_LOADS

on Intel processors.

On the other hand, mem-stores

L1-dcache-stores

and are L1-dcache-stores

mapped to the same performance event on all current Intel processors, namely MEM_UOPS_RETIRED.ALL_STORES

, which counts all MEM_UOPS_RETIRED.ALL_STORES

storages.

So in the end, if you use perf-stat

, you should (almost) always use L1-dcache-loads

and L1-dcache-stores

to count L1-dcache-stores

downloads and stores respectively. They map to the unhandled events you used in the answer you posted, only more portable as they also work on AMD processors.

0


source







All Articles