Fast 32-bit Array & # 8594; 24-bit array conversion in SSE3? (RGB32 & # 8594; RGB24)

This question is related to the previously answered question: Fast 24 bit array -> 32 bit array conversion? In one answer, interjay kindly posted the SSE3 code to convert RGB24 -> RGB32, however I also need the inverse conversion (RGB32 -> RGB24). I gave it a shot (see below) and my code definitely works, but it's more complex than interjay's code and noticeably slower. I couldn't figure out how to reverse the instructions exactly: _mm_alignr_epi8 doesn't seem useful in this case, but I'm not as familiar with SSE3 as I should be. Is asymmetry inevitable or is there a faster replacement for shifts and ORing?

RGB32 → RGB24:

__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,4, 5,6,8,9, 10,12,13,14, -1,-1,-1,-1);
for (UINT i = 0; i < Pixels; i += 16) {
    __m128i sa = _mm_shuffle_epi8(_mm_load_si128(src), mask);
    __m128i sb = _mm_shuffle_epi8(_mm_load_si128(src + 1), mask);
    __m128i sc = _mm_shuffle_epi8(_mm_load_si128(src + 2), mask);
    __m128i sd = _mm_shuffle_epi8(_mm_load_si128(src + 3), mask);
    _mm_store_si128(dst, _mm_or_si128(sa, _mm_slli_si128(sb, 12)));
    _mm_store_si128(dst + 1, _mm_or_si128(_mm_srli_si128(sb, 4), _mm_slli_si128(sc, 8)));
    _mm_store_si128(dst + 2, _mm_or_si128(_mm_srli_si128(sc, 8), _mm_slli_si128(sd, 4)));
    src += 4;
    dst += 3;
}

      

RGB24 -> RGB32 (courtesy interjay):

__m128i *src = ...
__m128i *dst = ...
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);
for (UINT i = 0; i < Pixels; i += 16) {
    __m128i sa = _mm_load_si128(src);
    __m128i sb = _mm_load_si128(src + 1);
    __m128i sc = _mm_load_si128(src + 2);
    __m128i val = _mm_shuffle_epi8(sa, mask);
    _mm_store_si128(dst, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
    _mm_store_si128(dst + 1, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
    _mm_store_si128(dst + 2, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
    _mm_store_si128(dst + 3, val);
    src += 3;
    dst += 4;
}

      

+3


source to share


2 answers


You can take this answer and change the shuffle mask to go from RGB32 to RGB24.



The big difference is to compute the shuffles directly and use bitwise operations to avoid offset. Also, using aligned streaming write instead of aligned write does not damage the cache.

0


source


Old question, but I was trying to solve the same problem, so ...

You can use palignr if you align correctly to its second operand, which fits in null bytes. You need to left-align the second, third and fourth word versions, and right-align the first, second, and third word.

For the second and third words, GCC is a little happier if I use offsets to compute a right-aligned left-aligned version. If I use two different pshufb it generates 3 unnecessary moves.

Here is the code. It uses exactly 8 registers; if you are in 64 bit you can try expanding it by two.



    __m128i mask_right = _mm_set_epi8(14, 13, 12, 10, 9, 8, 6, 5, 4, 2, 1, 0, 0x80, 0x80, 0x80, 0x80);
    __m128i mask = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 14, 13, 12, 10, 9, 8, 6, 5, 4, 2, 1, 0);

    for (; n; n -= 16, d += 48, s += 64) {
            __m128i v0 = _mm_load_si128((__m128i *) &s[0]);
            __m128i v1 = _mm_load_si128((__m128i *) &s[16]);
            __m128i v2 = _mm_load_si128((__m128i *) &s[32]);
            __m128i v3 = _mm_load_si128((__m128i *) &s[48]);

            v0 = _mm_shuffle_epi8(v0, mask_right);
            v1 = _mm_shuffle_epi8(v1, mask);
            v2 = _mm_shuffle_epi8(v2, mask);
            v3 = _mm_shuffle_epi8(v3, mask);

            v0 = _mm_alignr_epi8(v1, v0, 4);
            v1 = _mm_slli_si128(v1, 4);       // mask -> mask_right
            v1 = _mm_alignr_epi8(v2, v1, 8);
            v2 = _mm_slli_si128(v2, 4);       // mask -> mask_right
            v2 = _mm_alignr_epi8(v3, v2, 12);

            _mm_store_si128((__m128i *) &d[0], v0);
            _mm_store_si128((__m128i *) &d[16], v1);
            _mm_store_si128((__m128i *) &d[32], v2);
    }

      

The center section can also be written as follows. The compiler cuts one instruction less and it looks like it has a little more parallelism, but benchmarking is needed to get the correct answer:

            v0 = _mm_shuffle_epi8(v0, mask_right);
            v1 = _mm_shuffle_epi8(v1, mask);
            v2 = _mm_shuffle_epi8(v2, mask_right);
            v3 = _mm_shuffle_epi8(v3, mask);

            __m128i v2l = v2;
            v0 = _mm_alignr_epi8(v1, v0, 4);
            v1 = _mm_slli_si128(v1, 4);             // mask -> mask_right
            v2 = _mm_alignr_epi8(v3, v2, 12);
            v2l = _mm_srli_si128(v2l, 4);           // mask_right -> mask
            v1 = _mm_alignr_epi8(v2l, v1, 8);

      

0


source







All Articles