ARM Neon Assembler - weird pipeline issue

I am trying to improve the performance of a piece of code written in ARM Assembler using Neon instructions.

For testing and calculation, I use this calculator:

I noticed that all of a sudden on the line "n.34-0 1c n0" the neon one seems to have to wait (?) For 10 cycles. What could be causing this or is it just a bug in the calculator?

Also I need general information on how to improve performance in the ARM / Neon Assembler.

Target - ARM Cortex-A9. For compilation I am using the latest android ndk with inline assembler. Thank.


source to share

3 answers

It's actually a little more complicated. BitBank is right, NEON should wait for D4.

But you have to wait 10 cycles because Neon has a load / store queue. And the queue is filled with another instruction before

vld1.64 d4, [r7, :64]


So, when you need D4, you have to wait for this instruction, but in order to execute this instruction, you have to execute all the previous Load / Store instructions entered in the NEON load / store queue.



The NEON module must wait for this instruction because you are referencing a register (D4) that was loaded into the previous NEON instruction (n.33-0 1c n0). Loads are not instantaneous and due to pipelining there is a delay in data availability even if it comes from the cache. You need to change the order of your ARM and NEON instructions so that you don't try to use registers right after they are loaded, otherwise you end up wasteful loops (pipelines).



You shouldn't be accessing memory via ARM while NEON is doing its job. This causes a full brake on NEON.

You are apparently trying to do some kind of parallel processing that is destructive for the reason above.

Also, there are too many ldrb's. Accessing bytes on ARM is almost a sin as well.

I suggest that you completely rewrite your code in C, first using only 32-bit memory accesses, and then evaluate if it is meant for NEON at all.



All Articles