SSE2 optimization for RGB565 to RGB888 conversion (no alpha channel)

Question

SSE2 optimization for RGB565 to RGB888 conversion (no alpha channel)

I am trying to convert a buffer to bits, from 16 bits per pixel:

RGB 565: rrrrrggggggbbbb|rrr..

up to 24 bits per pixel:

RGB888 rrrrrrrrgggggggbbbbbbb|rrr...

I have a fairly optimized algorithm, but I'm very curious how this can be done with SSE. Seems like a good candidate. Suppose the input is a set of 16bpp, aligned memory and 64x64 pixels in size, since it fits perfectly, so the buffer is 64 * 64 * 16 and convert it to a 64 * 64 * 24 buffer.

If loading the initial color buffer (16bpp) into the __m128i registry (and then iterating), I can process 8 pixels each time. If using masks and shifts, I can extract each component in different registers (pseudocode):

eg r_c:  

Input buffer c565
Ouput buffer c888

__m128i* ptr = (__m128i*)c565; // Original byte buffer rgb565
__m128i r_mask_16 = _mm_set_epi8(0xF8, 0, 0xF8...);
__m128i r_c = _mm_and_si128(*ptr, r_mask_16);

result:

__m128i r_c = [r0|0|r1|0|....r7|0]
__m128i g_c = [g0|0|g1|0|....g7|0]
__m128i b_c = [b0|0|r1|0|....b7|0]

But if I extract them manually it loses all its performance:

c888[0] = r_c[0];
c888[1] = g_c[0];
c888[2] = b_c[0];
c888[3] = r_c[1];
...

I guess the correct way should be to merge them into a different registry and store it directly on the c888 without making each component separately. But not sure how I can do this efficiently, any thoughts?

Note. This question is not a duplicate of Optimizing RGB565 to RGB888 conversions with SSE2 . Converting from RGB565 to ARGB8888 is not the same as converting RGB565 to RGB888. The above question uses instructions,

punpcklbw
punpckhbw

and these instructions work well when there are (xmm (rb) xmm (ga) xmm (rgba) x2 pairs as they take RB from one register xmm and GA from the other and pack them into two xmms. But the case I am exposed is, when you don't need an alpha component.

+3

c ++ simd intrinsics sse2

JoniPichoni June 29. 17 at 8:14

source to share

1 answer

user8243310 · Answer 1 · 2017-07-03T05:21:56+0000

Unfortunately SSE doesn't have a good way to write out packed 24-bit integers, so we need to collect the pixel data ourselves.

24bpp pixels take up 3 bytes per pixel, but the XMM register is 16 bytes, which means we need to handle 3 * 16 pixels = 48 bytes at a time so we don't have to worry about saving part of the XMM register.

First, we need to load a vector from 16bpp data and then convert it to a pair of vectors from 32bpp data. I did this by unpacking the data into a uint32 vector and then shifting and masking that vector to extract the red, green and blue channels. OR'ing these together is the final step towards 32bpp. This can be replaced with the code from the linked question, if it's faster, I haven't rated the performance of my solution.

Once we convert the 16 pixels to 32bpp pixel vectors, these vectors need to be packed together and written to the result array. I decided to mask each pixel separately and use _mm_bsrli_si128

and _mm_bslli_si128

to move it to the final position in each of the three result vectors. OR when each of those pixels together again yields packed data, which is written to the result array.

I've tested that this code works, but I haven't taken any performance measurements, and I wouldn't be surprised if there are faster ways to do this, especially if you allow yourself to use something other than SSE2.

It writes 24bpp data with red channel as MSB.

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

#define SSE_ALIGN 16

int main(int argc, char *argv[]) {
    // Create a small test buffer
    // We process 16 pixels at a time, so size must be a multiple of 16
    size_t buf_size = 64;
    uint16_t *rgb565buf = aligned_alloc(SSE_ALIGN, buf_size * sizeof(uint16_t));

    // Fill it with recognizable data
    for (size_t i = 0; i < buf_size; i++) {
        uint8_t r = 0x1F & (i + 10);
        uint8_t g = 0x3F & i;
        uint8_t b = 0x1F & (i + 20);
        rgb565buf[i] = (r << 11) | (g << 5) | b;
    }

    // Create a buffer to hold the data after translation to 24bpp
    uint8_t *rgb888buf = aligned_alloc(SSE_ALIGN, buf_size * 3*sizeof(uint8_t));

    // Masks for extracting RGB channels
    const __m128i mask_r = _mm_set1_epi32(0x00F80000);
    const __m128i mask_g = _mm_set1_epi32(0x0000FC00);
    const __m128i mask_b = _mm_set1_epi32(0x000000F8);

    // Masks for extracting 24bpp pixels for the first 128b write
    const __m128i mask_0_1st  = _mm_set_epi32(0,          0,          0,          0x00FFFFFF);
    const __m128i mask_0_2nd  = _mm_set_epi32(0,          0,          0x0000FFFF, 0xFF000000);
    const __m128i mask_0_3rd  = _mm_set_epi32(0,          0x000000FF, 0xFFFF0000, 0         );
    const __m128i mask_0_4th  = _mm_set_epi32(0,          0xFFFFFF00, 0,          0         );
    const __m128i mask_0_5th  = _mm_set_epi32(0x00FFFFFF, 0,          0,          0         );
    const __m128i mask_0_6th  = _mm_set_epi32(0xFF000000, 0,          0,          0         ); 
    // Masks for the second write
    const __m128i mask_1_6th  = _mm_set_epi32(0,          0,          0,          0x0000FFFF);
    const __m128i mask_1_7th  = _mm_set_epi32(0,          0,          0x000000FF, 0xFFFF0000);
    const __m128i mask_1_8th  = _mm_set_epi32(0,          0,          0xFFFFFF00, 0         );
    const __m128i mask_1_9th  = _mm_set_epi32(0,          0x00FFFFFF, 0,          0         );
    const __m128i mask_1_10th = _mm_set_epi32(0x0000FFFF, 0xFF000000, 0,          0         );
    const __m128i mask_1_11th = _mm_set_epi32(0xFFFF0000, 0,          0,          0         );
    // Masks for the third write
    const __m128i mask_2_11th = _mm_set_epi32(0,          0,          0,          0x000000FF);
    const __m128i mask_2_12th = _mm_set_epi32(0,          0,          0,          0xFFFFFF00);
    const __m128i mask_2_13th = _mm_set_epi32(0,          0,          0x00FFFFFF, 0         );
    const __m128i mask_2_14th = _mm_set_epi32(0,          0x0000FFFF, 0xFF000000, 0         );
    const __m128i mask_2_15th = _mm_set_epi32(0x000000FF, 0xFFFF0000, 0,          0         );
    const __m128i mask_2_16th = _mm_set_epi32(0xFFFFFF00, 0,          0,          0         );

    // Convert the RGB565 data into RGB888 data
    __m128i *packed_rgb888_buf = (__m128i*)rgb888buf;
    for (size_t i = 0; i < buf_size; i += 16) {
        // Need to do 16 pixels at a time -> least number of 24bpp pixels that fit evenly in XMM register
        __m128i rgb565pix0_raw = _mm_load_si128((__m128i *)(&rgb565buf[i]));
        __m128i rgb565pix1_raw = _mm_load_si128((__m128i *)(&rgb565buf[i+8]));

        // Extend the 16b ints to 32b ints
        __m128i rgb565pix0lo_32b = _mm_unpacklo_epi16(rgb565pix0_raw, _mm_setzero_si128());
        __m128i rgb565pix0hi_32b = _mm_unpackhi_epi16(rgb565pix0_raw, _mm_setzero_si128());
        // Shift each color channel into the correct position and mask off the other bits
        __m128i rgb888pix0lo_r = _mm_and_si128(mask_r, _mm_slli_epi32(rgb565pix0lo_32b, 8)); // Block 0 low pixels
        __m128i rgb888pix0lo_g = _mm_and_si128(mask_g, _mm_slli_epi32(rgb565pix0lo_32b, 5));
        __m128i rgb888pix0lo_b = _mm_and_si128(mask_b, _mm_slli_epi32(rgb565pix0lo_32b, 3));
        __m128i rgb888pix0hi_r = _mm_and_si128(mask_r, _mm_slli_epi32(rgb565pix0hi_32b, 8)); // Block 0 high pixels
        __m128i rgb888pix0hi_g = _mm_and_si128(mask_g, _mm_slli_epi32(rgb565pix0hi_32b, 5));
        __m128i rgb888pix0hi_b = _mm_and_si128(mask_b, _mm_slli_epi32(rgb565pix0hi_32b, 3));
        // Combine each color channel into a single vector of four 32bpp pixels
        __m128i rgb888pix0lo_32b = _mm_or_si128(rgb888pix0lo_r, _mm_or_si128(rgb888pix0lo_g, rgb888pix0lo_b));
        __m128i rgb888pix0hi_32b = _mm_or_si128(rgb888pix0hi_r, _mm_or_si128(rgb888pix0hi_g, rgb888pix0hi_b));

        // Same thing as above for the next block of pixels
        __m128i rgb565pix1lo_32b = _mm_unpacklo_epi16(rgb565pix1_raw, _mm_setzero_si128());
        __m128i rgb565pix1hi_32b = _mm_unpackhi_epi16(rgb565pix1_raw, _mm_setzero_si128());
        __m128i rgb888pix1lo_r = _mm_and_si128(mask_r, _mm_slli_epi32(rgb565pix1lo_32b, 8)); // Block 1 low pixels
        __m128i rgb888pix1lo_g = _mm_and_si128(mask_g, _mm_slli_epi32(rgb565pix1lo_32b, 5));
        __m128i rgb888pix1lo_b = _mm_and_si128(mask_b, _mm_slli_epi32(rgb565pix1lo_32b, 3));
        __m128i rgb888pix1hi_r = _mm_and_si128(mask_r, _mm_slli_epi32(rgb565pix1hi_32b, 8)); // Block 1 high pixels
        __m128i rgb888pix1hi_g = _mm_and_si128(mask_g, _mm_slli_epi32(rgb565pix1hi_32b, 5));
        __m128i rgb888pix1hi_b = _mm_and_si128(mask_b, _mm_slli_epi32(rgb565pix1hi_32b, 3));
        __m128i rgb888pix1lo_32b = _mm_or_si128(rgb888pix1lo_r, _mm_or_si128(rgb888pix1lo_g, rgb888pix1lo_b));
        __m128i rgb888pix1hi_32b = _mm_or_si128(rgb888pix1hi_r, _mm_or_si128(rgb888pix1hi_g, rgb888pix1hi_b));

        // At this point, rgb888pix_32b contains the pixel data in 32bpp format, need to compress it to 24bpp
        // Use the _mm_bs*li_si128(__m128i, int) intrinsic to shift each 24bpp pixel into it final position
        // ...then mask off the other pixels and combine the result together with or
        __m128i pix_0_1st = _mm_and_si128(mask_0_1st,                 rgb888pix0lo_32b     ); // First 4 pixels
        __m128i pix_0_2nd = _mm_and_si128(mask_0_2nd, _mm_bsrli_si128(rgb888pix0lo_32b, 1 ));
        __m128i pix_0_3rd = _mm_and_si128(mask_0_3rd, _mm_bsrli_si128(rgb888pix0lo_32b, 2 ));
        __m128i pix_0_4th = _mm_and_si128(mask_0_4th, _mm_bsrli_si128(rgb888pix0lo_32b, 3 ));
        __m128i pix_0_5th = _mm_and_si128(mask_0_5th, _mm_bslli_si128(rgb888pix0hi_32b, 12)); // Second 4 pixels
        __m128i pix_0_6th = _mm_and_si128(mask_0_6th, _mm_bslli_si128(rgb888pix0hi_32b, 11));
        // Combine each piece of 24bpp pixel data into a single 128b variable
        __m128i pix128_0 = _mm_or_si128(_mm_or_si128(_mm_or_si128(pix_0_1st, pix_0_2nd), pix_0_3rd), 
                                        _mm_or_si128(_mm_or_si128(pix_0_4th, pix_0_5th), pix_0_6th));
        _mm_store_si128(packed_rgb888_buf, pix128_0);

        // Repeat the same for the second 128b write
        __m128i pix_1_6th  = _mm_and_si128(mask_1_6th,  _mm_bsrli_si128(rgb888pix0hi_32b, 5 ));
        __m128i pix_1_7th  = _mm_and_si128(mask_1_7th,  _mm_bsrli_si128(rgb888pix0hi_32b, 6 ));
        __m128i pix_1_8th  = _mm_and_si128(mask_1_8th,  _mm_bsrli_si128(rgb888pix0hi_32b, 7 ));
        __m128i pix_1_9th  = _mm_and_si128(mask_1_9th,  _mm_bslli_si128(rgb888pix1lo_32b, 8 )); // Third 4 pixels
        __m128i pix_1_10th = _mm_and_si128(mask_1_10th, _mm_bslli_si128(rgb888pix1lo_32b, 7 ));
        __m128i pix_1_11th = _mm_and_si128(mask_1_11th, _mm_bslli_si128(rgb888pix1lo_32b, 6 ));
        __m128i pix128_1 = _mm_or_si128(_mm_or_si128(_mm_or_si128(pix_1_6th, pix_1_7th),  pix_1_8th ), 
                                        _mm_or_si128(_mm_or_si128(pix_1_9th, pix_1_10th), pix_1_11th));
        _mm_store_si128(packed_rgb888_buf+1, pix128_1);

        // And again for the third 128b write
        __m128i pix_2_11th = _mm_and_si128(mask_2_11th, _mm_bsrli_si128(rgb888pix1lo_32b, 10));
        __m128i pix_2_12th = _mm_and_si128(mask_2_12th, _mm_bsrli_si128(rgb888pix1lo_32b, 11));
        __m128i pix_2_13th = _mm_and_si128(mask_2_13th, _mm_bslli_si128(rgb888pix1hi_32b,  4)); // Fourth 4 pixels
        __m128i pix_2_14th = _mm_and_si128(mask_2_14th, _mm_bslli_si128(rgb888pix1hi_32b,  3));
        __m128i pix_2_15th = _mm_and_si128(mask_2_15th, _mm_bslli_si128(rgb888pix1hi_32b,  2));
        __m128i pix_2_16th = _mm_and_si128(mask_2_16th, _mm_bslli_si128(rgb888pix1hi_32b,  1));
        __m128i pix128_2 = _mm_or_si128(_mm_or_si128(_mm_or_si128(pix_2_11th, pix_2_12th), pix_2_13th), 
                                        _mm_or_si128(_mm_or_si128(pix_2_14th, pix_2_15th), pix_2_16th));
        _mm_store_si128(packed_rgb888_buf+2, pix128_2);

        // Update pointer for next iteration
        packed_rgb888_buf += 3;
    }

    for (int i = 0; i < buf_size; i++) {
        uint8_t r565 = (i + 10) & 0x1F;
        uint8_t g565 = i & 0x3F;
        uint8_t b565 = (i + 20) & 0x1F;
        printf("%2d] RGB = (%02x,%02x,%02x), should be (%02x,%02x,%02x)\n", i, rgb888buf[3*i+2], 
                rgb888buf[3*i+1], rgb888buf[3*i], r565 << 3, g565 << 2, b565 << 3);
    }

    return EXIT_SUCCESS;
}

EDIT: here is the second way to compress 32bpp pixel data into 24bpp. I haven't tested if it's faster or not, although I would guess because it takes fewer instructions and doesn't need to run the OR tree at the end. It is less clear how this works at a glance.

In this version, a combination of shifting and shuffling is used to move each block of pixels together, instead of masking and moving each separately. The method used to convert 16bpp to 32bpp has not changed.

First, I define a helper function to shift the bottom left uint32 into each half of the __m128i.

__m128i bslli_low_dword_once(__m128i x) {
    // Multiply low dwords by 256 to shift right 8 bits
    const __m128i shift_multiplier = _mm_set1_epi32(1<<8);
    // Mask off the high dwords
    const __m128i mask = _mm_set_epi32(0xFFFFFFFF, 0, 0xFFFFFFFF, 0);

    return _mm_or_si128(_mm_and_si128(x, mask), _mm_mul_epu32(x, shift_multiplier));
}

Then the only other changes are for the code to pack 32bpp data into 24bpp.

// At this point, rgb888pix_32b contains the pixel data in 32bpp format, need to compress it to 24bpp
__m128i pix_0_block0lo = bslli_low_dword_once(rgb888pix0lo_32b);
        pix_0_block0lo = _mm_srli_epi64(pix_0_block0lo, 8);
        pix_0_block0lo = _mm_shufflelo_epi16(pix_0_block0lo, _MM_SHUFFLE(2, 1, 0, 3));
        pix_0_block0lo = _mm_bsrli_si128(pix_0_block0lo, 2);

__m128i pix_0_block0hi = _mm_unpacklo_epi64(_mm_setzero_si128(), rgb888pix0hi_32b);
        pix_0_block0hi = bslli_low_dword_once(pix_0_block0hi);
        pix_0_block0hi = _mm_bslli_si128(pix_0_block0hi, 3);

__m128i pix128_0 = _mm_or_si128(pix_0_block0lo, pix_0_block0hi);
_mm_store_si128(packed_rgb888_buf, pix128_0);

// Do the same basic thing for the next 128b chunk of pixel data
__m128i pix_1_block0hi = bslli_low_dword_once(rgb888pix0hi_32b);
        pix_1_block0hi = _mm_srli_epi64(pix_1_block0hi, 8);
        pix_1_block0hi = _mm_shufflelo_epi16(pix_1_block0hi, _MM_SHUFFLE(2, 1, 0, 3));
        pix_1_block0hi = _mm_bsrli_si128(pix_1_block0hi, 6);

__m128i pix_1_block1lo = bslli_low_dword_once(rgb888pix1lo_32b);
        pix_1_block1lo = _mm_srli_epi64(pix_1_block1lo, 8);
        pix_1_block1lo = _mm_shufflelo_epi16(pix_1_block1lo, _MM_SHUFFLE(2, 1, 0, 3));
        pix_1_block1lo = _mm_bslli_si128(pix_1_block1lo, 6);

__m128i pix128_1 = _mm_or_si128(pix_1_block0hi, pix_1_block1lo);
_mm_store_si128(packed_rgb888_buf+1, pix128_1);

// And again for the final chunk
__m128i pix_2_block1lo = bslli_low_dword_once(rgb888pix1lo_32b);
        pix_2_block1lo = _mm_bsrli_si128(pix_2_block1lo, 11);

__m128i pix_2_block1hi = bslli_low_dword_once(rgb888pix1hi_32b);
        pix_2_block1hi = _mm_srli_epi64(pix_2_block1hi, 8);
        pix_2_block1hi = _mm_shufflelo_epi16(pix_2_block1hi, _MM_SHUFFLE(2, 1, 0, 3));
        pix_2_block1hi = _mm_bslli_si128(pix_2_block1hi, 2);

__m128i pix128_2 = _mm_or_si128(pix_2_block1lo, pix_2_block1hi);
_mm_store_si128(packed_rgb888_buf+2, pix128_2);

SSE2 optimization for RGB565 to RGB888 conversion (no alpha channel)

More articles: