CPU values (cache misses / hits) that don't make sense

Question

CPU values (cache misses / hits) that don't make sense

I am using Intel PCM for fine grained CPU measurements. In my code, I am trying to measure the efficiency of the cache.

Basically, I first put a small array in the L1 cache (iterating over it many times), then I start the timer, iterate over the array again (which hopefully uses the cache), and then disable the timer.

The PCM shows me that I have a fairly high L2 and L3 bandwidth. I also checked with rdtscp

, and the loop for the array operation is 15 (which is well over 4-5 loops for accessing the L1 cache).

What I would expect is that the array is completely placed in the L1 cache and I would not have the high bandwidths of L1, L2 and L3.

My system has 32K, 256K and 25M for L1, L2 and L3 respectively. Here's my code:

static const int ARRAY_SIZE = 16;

struct MyStruct {
    struct MyStruct *next;
    long int pad;
}; // each MyStruct is 16 bytes

int main() {
    PCM * m = PCM::getInstance();
    PCM::ErrorCode returnResult = m->program(PCM::DEFAULT_EVENTS, NULL);
    if (returnResult != PCM::Success){
        std::cerr << "Intel PCM couldn't start" << std::endl;
        exit(1);
    }

    MyStruct *myS = new MyStruct[ARRAY_SIZE];

    // Make a sequential liked list,
    for (int i=0; i < ARRAY_SIZE - 1; i++){
        myS[i].next = &myS[i + 1];
        myS[i].pad = (long int) i;
    }
    myS[ARRAY_SIZE - 1].next = NULL;
    myS[ARRAY_SIZE - 1].pad = (long int) (ARRAY_SIZE - 1);

    // Filling the cache
    MyStruct *current;
    for (int i = 0; i < 200000; i++){
        current = &myS[0];
        while ((current = current->n) != NULL)
            current->pad += 1;
    }

    // Sequential access experiment
    current = &myS[0];
    long sum = 0;

    SystemCounterState before = getSystemCounterState();

    while ((current = current->n) != NULL) {
        sum += current->pad;
    }

    SystemCounterState after = getSystemCounterState();

    cout << "Instructions per clock: " << getIPC(before, after) << endl;
    cout << "Cycles per op: " << getCycles(before, after) / ARRAY_SIZE << endl;
    cout << "L2 Misses:     " << getL2CacheMisses(before, after) << endl;
    cout << "L2 Hits:       " << getL2CacheHits(before, after) << endl; 
    cout << "L2 hit ratio:  " << getL2CacheHitRatio(before, after) << endl;
    cout << "L3 Misses:     " << getL3CacheMisses(before_sstate,after_sstate) << endl;
    cout << "L3 Hits:       " << getL3CacheHits(before, after) << endl;
    cout << "L3 hit ratio:  " << getL3CacheHitRatio(before, after) << endl;

    cout << "Sum:   " << sum << endl;
    m->cleanup();
    return 0;
}

This is the conclusion:

Instructions per clock: 0.408456
Cycles per op:        553074
L2 Cache Misses:      58775
L2 Cache Hits:        11371
L2 cache hit ratio:   0.162105
L3 Cache Misses:      24164
L3 Cache Hits:        34611
L3 cache hit ratio:   0.588873

EDIT : I also checked the following code and was still getting the same miss rates (which I expected to get almost zero miss rates):

SystemCounterState before = getSystemCounterState();
// this is just a comment
SystemCounterState after = getSystemCounterState();

EDIT 2 . As pointed out in one of the comments, these results can be attributed to the overhead of the profiler itself. So instead of once, I changed the code that traverses the array many times (200,000,000 times) to amortize the profiler overhead. I am still getting very low L2 and L3 Cache ratios (% 15).

0

c ++ caching cpu cpu-cache performancecounter

narengi May 16 '15 at 1:00

source to share

1 answer

effenok · Answer 1 · 2015-07-25T17:59:47+0000

You seem to be getting l2 and l3 gaps from all cores on your system.

I'm considering a PCM implementation here: https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp

[1] in the implementation PCM::program()

on line 1407 I don't see any code that restricts events to a specific process

[2] if implemented PCM::getSystemCounterState()

on line 2809, you can see that events are collected from all cores on your system. So I would try to set the affinity of a process to one core and then only read events from that core - using this functionCoreCounterState getCoreCounterState(uint32 core)

CPU values ​​(cache misses / hits) that don't make sense

More articles:

CPU values (cache misses / hits) that don't make sense