Effects of 2 bit-bit-field array on cache performance and efficiency?

I need a 2-bit array, I don't do memory preservation at all, but I'm interested in minimizing cache misses and maximizing caching efficiency. Using the bools array will use 4x the memory, which means for every used chunk of data, there will be 3 in the cache that will not be used. So technically, I can get 3x better cache consistency if I use bitfields.

The plan is to implement it as an array of bytes, divided into 4 equal bitfields, and use the div function to be able to get the cumulative factor and remainder, possibly in one clock cycle, and use them to access the right index and right bitfields field.

The array I need is about 10,000 elements, so it will do significantly denser packed data using 2 actual bits to make the entire array fit in the L1 cache, and using a byte array would not be possible.

So my question is, can anyone tell me if this is a good idea in a performance oriented task, so I know if it is worth going ahead and implementing a 2 bit array? And of course the best way to find out is profiling, but any information in advance can be helpful and will be appreciated.

+3


source to share


2 answers


With 10,000 elements on a modern CPU, it should fit well into memory as bytes (10KB), so I wouldn't bother too much about that unless you want this to run on a very small microprocessor with a cache that's much smaller than the typical 16- 32KB L1 caches found in modern processors.



Of course, you may want to test performance with different solutions if you think this is an important part of your code from a performance standpoint [as measured from the profiling you already did before you start optimizing, right?].

+5


source


It is not clear to me that this will lead to performance gain. Each field will require multiple ( (data[i / 4] >> 2 * (i % 4)) & 0x03

) instructions to access , and many modern processors have an L3 cache that will hold the entire array with one byte per write. Whether the additional costs are paid for, the time will be more or less than the difference in caching is difficult to say; you will need to profile to know for sure.

If you can arrange algorithms to work a byte (or even a word) at a time, the cost of access can be much less. Iterating over the whole array, for example:



for ( int i = 0; i < 10000; i += 4 ) {
    unsigned char w1 = data[ i / 4 ];
    for ( int j = 0; j < 4; ++ j ) {
        unsigned char w2 = w1 & 0x03;
        //  w2 is entry i + j...
        w1 >>= 2;
    }
}

      

can be significant. Most compilers will be able to save w1

and w2

to register, which means that you'll have a 1/4 memory access. Packaging with

unsigned int` will probably be even faster.

+2


source







All Articles