IIR filter optimization

Quick question related to IIR filter coefficients. Here is a very typical implementation of a straight-form biquad IIR processor II I found on the internet.

// b0, b1, b2, a1, a2 are filter coefficients
// m1, m2 are the memory locations
// dn is the de-denormal coeff (=1.0e-20f) 

void processBiquad(const float* in, float* out, unsigned length)
{
    for(unsigned i = 0; i < length; ++i)
    {
        register float w = in[i] - a1*m1 - a2*m2 + dn;
        out[i] = b1*m1 + b2*m2 + b0*w;
        m2 = m1; m1 = w;
    }
    dn = -dn;
}

      

I understand that the "register" is a bit unnecessary, given how smart modern compilers talk about it. My question is, are there potential performance benefits for storing filter coefficients in separate variables rather than using arrays and dereferencing values? Will the answer to this question depend on the target platform?

i.e.

out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;

      

against

out[i] = b1*m1 + b2*m2 + b0*w;

      

+3


source to share


3 answers


It really depends on your compiler and optimization options. Here's my take:

  • Any modern compiler will simply ignore it register

    . This is just a hint of a compiler, and modern ones just don't use it.
  • Access to constant indexes in a loop is usually optimized in optimized compilation. In a sense, using variables or an array, as you have shown, makes no difference.
  • Always, always run your tests and look at the generated code for critical sections of code performance.

EDIT: OK, just out of curiosity, I wrote a little program and ended up with "identical" code generated using full optimization with VS2010. Here is what I get inside the loop for the expression in question (exactly the same for both cases):



0128138D  fmul        dword ptr [eax+0Ch]  
01281390  faddp       st(1),st  
01281392  fld         dword ptr [eax+10h]  
01281395  fld         dword ptr [w]  
01281398  fld         st(0)  
0128139A  fmulp       st(2),st  
0128139C  fxch        st(2)  
0128139E  faddp       st(1),st  
012813A0  fstp        dword ptr [ecx+8]  

      

Note that I've added a few lines to output the results to make sure the compiler isn't just optimizing everything. Here is the code:

#include <iostream>
#include <iterator>
#include <algorithm>

class test1 
{
    float a1, a2, b0, b1, b2;
    float dn;
    float m1, m2;

public:
    void processBiquad(const float* in, float* out, unsigned length)
    {
        for(unsigned i = 0; i < length; ++i)
        {
            float w = in[i] - a1*m1 - a2*m2 + dn;
            out[i] = b1*m1 + b2*m2 + b0*w;
            m2 = m1; m1 = w;
        }
        dn = -dn;
    }
};

class test2 
{
    float a[2], b[3];
    float dn;
    float m1, m2;

public:
    void processBiquad(const float* in, float* out, unsigned length)
    {
        for(unsigned i = 0; i < length; ++i)
        {
            float w = in[i] - a[0]*m1 - a[1]*m2 + dn;
            out[i] = b[0]*m1 + b[1]*m2 + b[2]*w;
            m2 = m1; m1 = w;
        }
        dn = -dn;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    test1 t1;
    test2 t2;

    float a[1000];
    float b[1000];

    t1.processBiquad(a, b, 1000);
    t2.processBiquad(a, b, 1000);

    std::copy(b, b+1000, std::ostream_iterator<float>(std::cout, " "));

    return 0;
}

      

+5


source


I'm not sure, but this:

out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;

      



may be worse because it will compile indirect access, which is worse than direct access performance.

The only way to actually see is to inspect the compiled assembler and profile the code.

+3


source


You will most likely get an advantage if you can declare the coefficients b0, b1, b2 as const. The code will be more efficient if any of your operands are known and fixed at compile time.

+2


source







All Articles