How do I use hardware performance counters in aarch64 assembly?

I am trying to run some generated build for ARM architectures. In this particular case, the goal aarch64-unknown-linux-gnu

. I really want to count on separate loops by doing multiple runs to get the minimum time and eliminate variance.

I don't have direct access to ARM hardware, so I am trying to run my code under QEMU.

For x86 / x86_64, I use instructions rdtsc

and rdtscp

to return the number of cycles.

For aarch64, I thought I could use

let clocks: u64;
asm!("mrs $0, pmccntr_el0" : "=r" (clocks) ::: "volatile");

      

But when I run

qemu-aarch64 -L /usr/aarch64-linux-gnu myprogram

      

I get

qemu: uncaught target signal 4 (Illegal instruction) - core dumped

      

I thought that maybe setting some bits in the register was pmcr_el0

required, but even reading with

let pmcr: u32;
asm!("mrs $0, pmcr_el0" : "=r" (pmcr) ::: "volatile");

      

gives the same error Illegal instruction

.

It seems to me that these are privileged instructions that should be included for me, but I couldn't find any documentation on how to do this with QEMU.

So, is there a way to access performance hardware in QEMU? Is there a way to count loops in some other way? I really wanted it to fit as closely as possible to x86 code.

+3


source to share


1 answer


You seem to have forgotten to turn on some bits in the pmuserenr register.

Also, to use the Performance Monitors extension, follow Chapter D6, ARMv8 Architecture Reference Guide .

Please note that QEMU is not the right tool for code profiling and optimization.

QEMU's first target is emulation speed (> 40 MIPS), and it provides a robust healthy architecture profile for OS development. And then QEMU does not need to support the exact function of ARMv8 performance monitors, the current implementation is quite abstract and minimal: there is nothing but the inaccurate PMCCNTR cycle counter model, and there is no performance monitoring event infrastructure at all.

Better to use a regular physical counter to create time slots:

mrs x0, cntpct_el0

      

To understand why looping calculations on QEMU are useless, consider that QEMU is a functional model and is based on some assumptions:

1) All instructions are executed sequentially one after the other and each time consume the same period of time:

 1 guest instruction counter tick = 1 emulated nano second << icount_time_shift

      

icount_time_shift is defined by the "-icount" comand line option, by default it is 3. Then 1 emulated guest instruction consists of 8 emulated nanoseconds.



This strict conversion between the command counter and nano-seconds is a key concept of the QEMU dynamic guest code translation mechanism, which allows deterministic generation of translation units (TB): the peripheral model that inherits from nano-second is associated with TB execution, which is controlled by the command counter ...

For example, you execute 10 guest instructions as TB and then jump to the peripheral clock down to 80ns. In addition, the peripheral can indicate the TB execution cycle that there is no expectation for 800 ns and the next 100 instructions can be executed in one TB.

2) Emulation nano-second is the base unit of hours and provides a quantum of time in qemu, and all other guest counters are scaled from it by some whole factor:

for example, the current QEMU implementation for hardcoding an ARM physical system (CNTPCT) is 62 MHz. Then

scale_factor = 10^9 / (62 *10^6) = 16, (division is integer)

      

i.e. QEMU makes a one-time increase in CNTPCT by 16 increments of emulated nanoseconds. ARMv8 Generic Timer QEMU based on this scale.

Also QEMU implements PMCR as a counter with some integer scale.

In QEMU, you can manually count the instructions in your guest program, multiple of it down to some constant, and I claim it will equal the value your guest code is trying to compute at runtime on QEMU.

And these results will have nothing to do with real code running on HW: you need to use one of the proprietary performance simulators for the target microarchitecture with a model and cache pipeline, or test it directly on HW.

+6


source







All Articles