Real-Time Programming C Performance Dilemma

I am working on an embedded architecture where ASM is dominant. I would like to refactor most of our legacy C ASM code to improve readability and modularity.

So, I'm still puzzled by the little details that will make my hopes disappear. The real problem is much more complicated than the next example, but I would like to share this as a point of entry into a discussion.

My goal is to find the optimal workaround.

Here's an original example (don't worry about what the code does. I wrote this randomly to show the problem I'd like to talk about).

int foo;
int bar;
int tmp;
int sum;

void do_something() {
    tmp = bar;
    bar = foo + bar;
    foo = foo + tmp;

void compute_sum() {
    for(tmp = 1; tmp < 3; tmp++)
        sum += foo * sum + bar * sum;

void a_function() {


With this dummy code, everyone will immediately remove all global variables and replace them with local ones:

void do_something(int *a, int *b) {
    int tmp = *b;
    *b = *a + *b;
    *b = tmp + *a;

void compute_sum(int *sum, int foo, int bar) {
    int tmp;
    for(tmp = 1; tmp < 3; tmp++)
        sum += *foo * sum + *bar * sum;

void a_function(int *sum, int *foo, int *bar) {
    compute_sum(sum, foo, bar);
    do_something(foo, bar);


Unfortunately, this revision is worse than the original code, because all parameters are pushed onto the stack, which leads to delays and a larger code size.

The solution everything globals

is the best solution to the ugliest. Especially when the source code is about 300K lines long with almost 3000 global variables.

Here we are not facing a compiler problem, but a structural problem. Writing beautiful, portable, readable, modular, and reliable code will never pass the ultimate benchmark because compilers are stupid, even before 2015.

An alternative solution is to rather prefer functions inline

. Unfortunately, these functions have to be located in a header file, which is ugly as well.

The compiler cannot see the file it is working on. When a feature is marked as extern

, it will irrevocably lead to performance problems. The reason is that the compiler cannot make any assumptions about external declarations.

Otherwise, the linker might do the job and ask the compiler to rebuild the object files by adding more information to the compiler. Unfortunately, not many compilers offer such features, and when they do, they slow down the build process significantly.

Eventually I ran into this dilemma:

  • Keep your code ugly to keep performance

    • Everything is global
    • Functions without parameters (same as procedures)
    • Saving everything in one file
  • Follow standards and write clean code

    • Think about modules
    • Write small but numerous functions with well-defined parameters
    • Write small but numerous source files

What to do if the target architecture has limited resources. Going back to assembly is my last option.

Additional Information

I am working on SHARC architecture, which is a pretty powerful Harvard CISC architecture. Unfortunately, one code instruction takes 48 bits and long

only takes 32 bits. With this fact, it is better to keep the version of the variable rather than evaluating the second value on the fly:

Optimized example:

int foo;
int bar;
int half_foo;

void example_a() {
   write(half_foo + bar);



void example_a(int foo, int bar) {
   write(bar + (foo >> 1));



source to share

3 answers

I'm used to working in critical areas of the kernel and kernel with very tight needs, often benefitting from adopting the optimizer and performance of the standard library with some salt (ex: not worrying too much about speed malloc

or autogenerated vectorization).

However, I've never had such tough needs to make the number of instructions or the velocity of pushing more arguments on the stack. If this is indeed a major problem for the target system and performance tests, then it should be noted that performance tests modeled at the micro-granularity often have an obsession with the least of micro-performance.

Microefficiency tests

We made the mistake of writing all sorts of shallow micro-level tests in the previous workplace I was where we were doing tests, just to make it as simple as reading a single 32-bit float from a file. In the meantime, we did an optimization that significantly sped up large-scale real-world test cases dealing with reading and parsing the contents of entire files, while some of these uber-micro tests actually got slower for some unknown reason (they haven't even been directly modified, but changes to the code around them might have some indirect impact related to dynamic factors like caches, paging, etc. or just how the optimizer handled such code).

So the micro-level world can get a little more chaotic when you're working with a higher-level language than assembly. The performance of teenage stuff may get a little under your feet, but you have to ask yourself which is more important: a slight decrease in performance reading a single 32-bit float from a file, or performing operations in the real world that are read from entire files are significantly faster. Modeling your performance tests and profiling at a higher level will give you the ability to selectively and productively optimize the parts that really matter. There you have many ways to dump the cat.

Run the profiler on a super-gravimetric operation performed a million times, and you would already be supporting yourself in the assembly-type micro-corner for anyone doing such tests at the micro level just by how you profile the code. So you really want to shrink it down a bit, test things on a grosser level so you can act like a disciplined sniper and hone the micro-efficiency of very select units by channeling leaders for inefficiency rather than trying to be the hero takes out every minor infantryman that can be a hindrance. for execution.

Linker optimization

One of your misconceptions is that only the compiler can act as an optimizer. Linkers can perform various optimizations when linking object files together, including code inlining. Therefore, it is rarely, if ever, necessary to hammer everything into one object file as an optimization. I would try to take a closer look at your linker settings if you find another.

Interface design

In this case, the key to a supported, large-scale code block has more to do with the interface (i.e. the header files) than the implementation (the source files). If you have a car with an engine that travels a thousand miles an hour, you can look under the hood and find that little demons smoking smoke are dancing around it to make this happen. There may have been a pact involving demons to gain that speed. But you don't need to disclose this fact to the people driving the car. You can still give them a good set of intuitive and safe controls to control this beast.

So you might have a system that makes non-isolated function calls "expensive" but expensive relative to what? If you call a function that sorts a million items, the relative cost of pushing multiple small arguments onto the stack, such as pointers and integers, should be completely trivial, no matter what hardware you mean. Inside a function, you can do all sorts of things related to profilers to improve performance, like macros, to strong inline code regardless, perhaps even with some inline assembly, but the key is for that code to cascade its complexity across your entire system is to keep all this daemon code hidden from people using your sort function, and to make surethat it's well tested so that people don't have to constantly push the hood out trying to figure out the source of the failure.

Ignoring the "relative to what?" questioning and focusing on absolutes also leads to micro-profiling, which can be more counterproductive than beneficial.

So, I would suggest looking at this more from the design level of the public interface, because behind the interface, if you look behind the curtains / under the hood, you can find all sorts of evil deeds, in performance in the hot spot areas shown in the profiler. But you don't need to pop up the hood very often if your interfaces are well designed and well tested.

Globals become more of a problem the wider their reach. If you have statically defined globals that are internally linked within a source file that no one else can access, then these are rather "local" globals. If thread safety is not a concern (if it is, then you should avoid volatile globals as much as possible), then you may have a number of critical areas in your codebase where if you look under the hood you will find the frequency file - static variables to reduce the overhead of function calls. This is still much easier to maintain than assembly, especially when the visibility of such globals decreases with smaller and smaller source files designed to carry out clearer and clearer responsibilities.



Ugly C code is still much more readable than assembler. Plus, you'll probably get some unexpected free optimizations.

The compiler cannot see the file it is working on. When a feature is marked as extern

, it will irrevocably lead to performance problems. This is because the compiler cannot make any assumptions about external declarations.

False and false. Have you tried Optimizing Whole Program? Benefits of built-in functions without having to organize them into headers. It's not that putting things in the headlines is necessarily ugly if you're organizing the headlines.

In your VisualDSP ++ compiler this is activated with a switch -ipa


The ccts compiler has a feature called inter-procedural analysis (IPA), a mechanism that allows the compiler to optimize translation units instead of a single translation unit. This ability effectively allows the compiler to see all source files that are used in the final link at compile time and use this information for optimization.

All -ipa optimizations are called after the initial link, after which a special program called prelinker reruns the compiler to perform new optimizations.



I have developed / wrote / tested / documented many realtime embedded systems.

Both are "soft" real-time and "hard" real-time.

I can tell you with confidence that the algorithm used to implement the application is where the greatest gains in speed can be.

Little stuff like function call versus inline is trivial unless executed thousands (or even hundreds of thousands) of times



All Articles