What happens to deferred exit instructions on a branch in an ARM build?

I am optimizing an algorithm in an ARM assembly and have to figure out in what order to place instructions to minimize pipelines. The loop counter at http://pulsar.webshaker.net/ccc/index.php?lng=us is very helpful in doing this, but is unaware of what is going on when the functions / branches are called. What I want to do is basically (this is just an example):

mul       r4, r0, r1
mov       r0, #0
mov       r1, #12
mov       r4, r4, ASR #14
str       r4, [r5]
bl        foo

      

The pipeline breakdown between instructions mul

and is mov

pretty terrible, and there is nothing stopping me from making a function call between them. But what exactly happens to the pipeline when I do an affiliate? I know what foo

to do push {r4-r12, lr}

as the first instruction. I see two possible outcomes:

  • The branch instruction takes multiple loops that allow the instruction to mul

    deliver its result before execution push

    , thereby reducing pipeline counterparts.
  • The pipeline stall is increasing as it push

    takes r4

    several cycles before it is executed (this was before ARMv7 IIRC, the cycle counter in the link doesn't seem to think it is necessary).

In short:
What happens to lazy statements ( mul

is the main example) when you make a function call (which is supposed to push a register on the stack) or even a normal branch?

+3


source to share


2 answers


If I understand that you don't need to do

mov       r4, r4, ASR #14
str       r4, [r5]

      

before calling. Making a call before mov

bl        foo
mov       r4, r4, ASR #14
str       r4, [r5]

      

- a good idea.



The mule will have more time to finish while talking. STM will be a problem to be understood. You can of course press R4 before calculating it.

If foo is an asm function, you can save R4 later in foo (you can probably try not to use r4 and then not save it).

if the function foo is a C function (or if you can change the push command). use r12 instead of r4 as the MUL destination register.

R12 will be required later by STM instruction. Then it is possible that mul has enough time to complete before the destination register (R12) needs STM!

+1


source


I'm not sure what the answer is, but I'm sure if the answer is public, it will be in the Cortex-A8 Technical Reference Manual .



0


source







All Articles