How to speed up the below code to compute LBP on CPU significantly?

Question

How to speed up the below code to compute LBP on CPU significantly?

The code below is heavily referred to as an object detection program and costs about 80% of the execution time. Is there a way to speed it up significantly?

#define CALC_SUM_(p0, p1, p2, p3, offset) ((p0)[offset] - (p1)[offset] - (p2)[offset] + (p3)[offset])
inline int calc_lbp2(float *p[], int offset)
{
    int cval = CALC_SUM_( p[5], p[6], p[9], p[10], offset );

    return (CALC_SUM_( p[0],  p[1],  p[4],  p[5],  offset ) >= cval ? 128 : 0) |   // 0
           (CALC_SUM_( p[1],  p[2],  p[5],  p[6],  offset ) >= cval ? 64 : 0) |    // 1
           (CALC_SUM_( p[2],  p[3],  p[6],  p[7],  offset ) >= cval ? 32 : 0) |    // 2
           (CALC_SUM_( p[6],  p[7],  p[10], p[11], offset ) >= cval ? 16 : 0) |  // 5
           (CALC_SUM_( p[10], p[11], p[14], p[15], offset ) >= cval ? 8 : 0)|  // 8
           (CALC_SUM_( p[9],  p[10], p[13], p[14], offset ) >= cval ? 4 : 0)|   // 7
           (CALC_SUM_( p[8],  p[9],  p[12], p[13], offset ) >= cval ? 2 : 0)|    // 6
           (CALC_SUM_( p[4],  p[5],  p[8],  p[9],  offset ) >= cval ? 1 : 0);
}

I tried SSE, but the program cost more than 50ms (initial execution time was about 170ms):

inline int calc_lbp_sse(float *p[], int offset)
{
    static unsigned short bits[] = {0x0080, 0x0040, 0x0020, 0x0010, 0x0008, 0x0004, 0x0002, 0x0001};
    short c = CALC_SUM_( p[5], p[6], p[9], p[10], offset );
    __m128i a = _mm_setr_epi16
                (
                    CALC_SUM_( p[0],  p[1],  p[4],  p[5],  offset ),
                    CALC_SUM_( p[1],  p[2],  p[5],  p[6],  offset ),
                    CALC_SUM_( p[2],  p[3],  p[6],  p[7],  offset ),
                    CALC_SUM_( p[6],  p[7],  p[10], p[11], offset ),
                    CALC_SUM_( p[10], p[11], p[14], p[15], offset ),
                    CALC_SUM_( p[9],  p[10], p[13], p[14], offset ),
                    CALC_SUM_( p[8],  p[9],  p[12], p[13], offset ),
                    CALC_SUM_( p[4],  p[5],  p[8],  p[9],  offset )
                );
    __m128i b = _mm_setr_epi16(c, c, c, c, c, c, c, c);

    __m128i res = _mm_cmplt_epi16(b,a);
    unsigned short* vals = (unsigned short*)&res;

    return ((vals[0]&bits[0]) | (vals[1]&bits[1]) | (vals[2]&bits[2]) | (vals[3]&bits[3]) |
            (vals[4]&bits[4]) | (vals[5]&bits[5]) |(vals[6]&bits[6]) |(vals[7]&bits[7]));
}

+3

optimization c x86 sse simd

Terry wu 09 dec. '14 at 2:30

source to share

1 answer

JS1 · Answer 1 · 2014-12-09T08:53:22+0000

I ran your function 200,000,000 times on my desktop computer and it took 5.3 seconds. Then I changed this line:

int cval = CALC_SUM_( p[5], p[6], p[9], p[10], offset );

:

float cval = CALC_SUM_( p[5], p[6], p[9], p[10], offset );

I repeated the same test and now it took 3.0 seconds. Now I'm not familiar with LBP, but it looks like you weren't intentionally casting your center value to int. From what I read about LBP, you are simply comparing adjacent values to the center value. But if casting to int is really important, then just ignore this answer.

As an aside, I tried what japreiss suggested replacing ? :

with << 6

, but I got exactly the same speeds anyway. Apparently the compiler has already optimized this (I am using gcc -O3

).

How to speed up the below code to compute LBP on CPU significantly?

More articles: