Page alignment does not affect performance ...?

I am developing an application for which performance is a fundamental concern. In particular, I was willing to organize a tree structure that needed to be traversed very quickly in blocks of the same size as my memory page size in order to reduce the number of cache misses required to reach the leaf.

I am new to memory optimization field. As I understand it, the process of accessing main memory goes more or less like this:

  • CPUs have multiple levels of caches with increasing size and speed.
  • Every time some data I need is already in the cache, it is fetched from the cache (cache file).
  • If it is not in the cache, it will be fetched from main memory.
  • Anytime something is loaded from main memory, the entire page (or pages) containing the data is loaded and stored in the cache. This way, if I try to access locations in memory that are close to the ones I have already fetched from main memory, they will already be in my processor cache.

However, if I organize my data in blocks of the same size as the size of my memory page, I thought it would also be necessary to align this data correctly so that whenever a new block of my data needs to be loaded, only one page of memory needs to be retrieved from main memory, not two pages containing the first half and the second half of my data block). Basically, shouldn't a properly sized block of data mean only one memory access, not two? Shouldn't this have more or less double the memory performance?

I tried the following:

#include <iostream>
#include <unistd.h>
#include <stdio.h>

#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

using namespace std;

#define BLOCKS 262144
#define TESTS 131072

unsigned long int utime()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return tp.tv_sec * 1000000 + tp.tv_usec;
}

unsigned long int pagesize = sysconf(_SC_PAGE_SIZE);
unsigned long int block_slots = pagesize / sizeof(unsigned int);
unsigned int t = 0;
unsigned int p = 0;

unsigned long int test(unsigned int * data)
{
    unsigned long int start = utime();

    for(unsigned int n=0; n<TESTS; n++)
    {
        for(unsigned int i=0; i<block_slots; i++)
            t += data[p * block_slots + i];

        p = t % BLOCKS;
    }

    unsigned long int end = utime();

    return end - start;
}

int main()
{
    srand((unsigned int) time(NULL));

    char * buffer = new char[(BLOCKS + 3) * pagesize];

    for(unsigned int i=0; i<(BLOCKS + 3) * pagesize; i++)
        buffer[i] = rand();        

    for(unsigned int i=0; i<pagesize; i++)
        cout<<test((unsigned int *) (buffer + i))<<endl;

    cout<<"("<<t<<")"<<endl;
    delete [] buffer;
}

      

This code creates more or less 1 GB of empty bytes, fills them with random numbers. A functional test is then called with all possible offsets in the memory page (offset 0 to offset 4096). The test function interprets the pointer provided as a group of data blocks and performs some simple operation (sum) on those blocks. The order of access to blocks is more or less random (it is determined by partial sums), so every time a new block is accessed, it will almost certainly not already be in the cache.

The functional test is then checked. In all but one shear configurations, I must observe for a while, while in one particular shear configuration (zero shear, maybe?) I must observe some significant improvement in terms of efficiency. This, however, does not happen at all: all switch timings are perfectly compatible with each other.

Why is this happening and what does it mean? Can I just forget about memory alignment? Can I also forget that my data blocks will be as large as a memory page? (I was planning on using some add-ons if they were smaller). Or maybe something in the process of managing the cache is not clear to me?

+3


source to share





All Articles