Load vector from large vector with mask based simd

Hope someone can help here.

I have a large vector of bytes from which I create a small byte vector (based on a mask) which I then process with simd.

The mask is currently a baseOffset + subpattern array (bytes [256]), optimized to store as is> 10 ^ 8. I create a maxsize subsector, then loop through the mask array multiplies baseOffssetby 256 and for each bit offset in the load mass from of a large vector and sequentially places the values ​​into the smaller vector. The smaller vector is then processed by multiple VPMADDUBSWs and accumulated. I can change this structure. for example walk a bit once to use an 8KB array buffer and then create a small vector.

Is there a faster way to create a subarray?

I pulled the code from the application into the test program, but the original is in a flow state (going to AVX2 and pulling more from C #)

#include "stdafx.h"
#include<stdio.h>
#include <mmintrin.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <smmintrin.h>
#include <immintrin.h>


//from 
char N[4096] = { 9, 5, 5, 5, 9, 5, 5, 5, 5, 5 };
//W
char W[4096] = { 1, 2, -3, 5, 5, 5, 5, 5, 5, 5 };

char buffer[4096] ; 





__declspec(align(2))
struct packed_destination{
    char blockOffset;
    __int8   bitMask[32];

};

__m128i sum = _mm_setzero_si128();
packed_destination packed_destinations[10];



void  process128(__m128i u, __m128i s)
{
    __m128i calc = _mm_maddubs_epi16(u, s); // pmaddubsw 
    __m128i loints = _mm_cvtepi16_epi32(calc);
    __m128i hiints = _mm_cvtepi16_epi32(_mm_shuffle_epi32(calc, 0x4e));
    sum = _mm_add_epi32(_mm_add_epi32(loints, hiints), sum);
}

void process_array(char n[], char w[], int length)
{
    sum = _mm_setzero_si128();
    int length128th  = length >> 7;
    for (int i = 0; i < length128th; i++)
    {
        __m128i u = _mm_load_si128((__m128i*)&n[i * 128]);
        __m128i s = _mm_load_si128((__m128i*)&w[i * 128]);
        process128(u, s);
    }
}


void populate_buffer_from_vector(packed_destination packed_destinations[], char n[]  , int  dest_length)
{
    int buffer_dest_index = 0; 
    for (int i = 0; i < dest_length; i++)
    {
        int blockOffset = packed_destinations[i].blockOffset <<8 ;
        // go through mask and copy to buffer
        for (int j = 0; j < 32; j++)
        {
           int joffset = blockOffset  + j << 3; 
            int mask = packed_destinations[i].bitMask[j];
            if (mask & 1 << 0)
                buffer[buffer_dest_index++] = n[joffset +  1<<0 ];
            if (mask & 1 << 1)
                buffer[buffer_dest_index++] = n[joffset +  1<<1];
            if (mask & 1 << 2)
                buffer[buffer_dest_index++] = n[joffset +  1<<2];
            if (mask & 1 << 3)
                buffer[buffer_dest_index++] = n[joffset +   1<<3];
            if (mask & 1 << 4)
                buffer[buffer_dest_index++] = n[joffset +  1<<4];
            if (mask & 1 << 5)
                buffer[buffer_dest_index++] = n[joffset +  1<<5];
            if (mask & 1 << 6)
                buffer[buffer_dest_index++] = n[joffset + 1<<6];
            if (mask & 1 << 7)
                buffer[buffer_dest_index++] = n[joffset +  1<<7];
        };

    }


}

int _tmain(int argc, _TCHAR* argv[])
{
    for (int i = 0; i < 32; ++i)
    {
        packed_destinations[0].bitMask[i] = 0x0f;
        packed_destinations[1].bitMask[i] = 0x04;
    }
    packed_destinations[1].blockOffset = 1;

    populate_buffer_from_vector(packed_destinations, N, 1);
    process_array(buffer, W, 256);

    int val = sum.m128i_i32[0] +
        sum.m128i_i32[1] +
        sum.m128i_i32[2] +
        sum.m128i_i32[3];
    printf("sum is %d"  , val);
    printf("Press Any Key to Continue\n");
    getchar();
    return 0;
}

      

Typically mask usage will be 5-15% for some workloads, this will be 25-100%.

MASKMOVDQU is close, but then we'll need to repackage / swl according to the mask before saving.

+3


source to share


1 answer


Several optimizations for your existing code:

If your data is sparse, it would probably be a good idea to add an extra test for each 8-bit mask value before testing the extra bits, i.e.

        int mask = packed_destinations[i].bitMask[j];
        if (mask != 0)
        {
            if (mask & 1 << 0)
                buffer[buffer_dest_index++] = n[joffset +  1<<0 ];
            if (mask & 1 << 1)
                buffer[buffer_dest_index++] = n[joffset +  1<<1];
            ...

      

Second, your function process128

can be greatly optimized:



inline __m128i process128(const __m128i u, const __m128i s, const __m128i sum)
{
    const __m128i vk1 = _mm_set1_epi16(1);
    __m128i calc = _mm_maddubs_epi16(u, s);
    calc = _mm_madd_epi16(v, vk1);
    return _mm_add_epi32(sum, calc);
}

      

Note that as well as reducing the number of SSE commands from 6 to 3, I also made an option sum

to get away from any dependency on globals (it's always good to avoid global bindings, not only for good software development, but also because they may hinder some compiler optimizations).

It would be interesting to see a profile of your code (using a good sampling profiler rather than tooling) as this will help prioritize further optimization efforts.

+1


source







All Articles