Binary tree cache locality

If I have a tree like the following

struct tree_t {
    //data
    tree_t *left;
    tree_t *right;
};

      

and I want to start allocating memory for leaves, is there a way to ensure that when traversing a tree, the leaves are cached? If I was using malloc I think the leaves will be scattered on the heap and every time I try to access it the cache will be missing.

+3


source to share


5 answers


Others have given the correct answer: a fixed size pool ( https://en.wikipedia.org/wiki/Memory_pool ), but there are some additional caveats that deserve a more, thorough description. It is still possible, or even likely, that leaves allocated using a memory pool will have a low cache hit ratio. 8 * n tree_t blocks aligned on 64-byte boundaries are ideal, although there is no advantage above n == 1024.

As a side note, consider the Judy arrays, which are a tree data structure in the form of cache optimizations ( https://en.wikipedia.org/wiki/Judy_array ).

It is helpful to consider (briefly) how the cache works. Caches are subdivided into fixed line groups. Typically, the line size is 64 bytes in L1; Intel and AMD have been using 64-byte L1 caching for 15 years, and modern ARM processors like the A15 also use 64-byte strings. Associativity determines how many rows match a set. When data is entered into the cache, some functions hash the address into a set. Data can be stored in any of the rows in the set. For example, in a bi-directional associative cache set, any given address can be stored in one of two possible cache lines.

Maximum caching involves reducing the cache fetch:
1. Organizing data into cache-sized chunks.
2. Alignment of pieces on the border of the cache line.
3. Allocation of fragments by addresses that are mapped to different sets of caches.
4. Storing data with temporary locality (ie, Accessing at about the same time) in the same cache line.
5. Reducing the size of the data, if possible, to increase the density.



If you don't do (1), then the samples will result in close, probably unused data, reducing the amount of space for the data you are interested in. If you don't (2), your objects will most likely go to cache lines, requiring twice as many fetch. If you don't (3) then some cache sets are likely to be underutilized, with a similar effect (1). If you don't do (4), then even if you maximize your cache usage, most of the fetched data is not useful when it is fetched, and the line is likely to be evicted before the data is useful. (5) increases the number of objects that fit into the cache by packing them into less space. For example, if you can guarantee that you have less than 2 ^ 32 sheets, you can store the uint32_t index into the tree_t [] array.not pointers, which will improve 100% on 64-bit platforms.

Note. malloc () usually returns 8-byte or 16-byte aligned blocks that are unusable; use posix_memalign () for GCC or _aligned_malloc () for MSVC.

In your case, you are traversing a tree, presumably in order of traversal. If your dataset doesn't fit in the cache, the leaves are likely to be evenly distributed throughout, and the ergo is unlikely to have temporary locality with nodes in the same cache line. In this case, the best thing to do is to make sure your objects do not cross cache lines by allocating cache line sized chunks and line aligned cached lines.

I chose blocks from 8 * n tree_t for the conservative assumption of 64 byte cache line and 4 byte pointers, which results in 8 byte tree_t and 64/8 = 8 tree_t / line. The upper bound of n == 1024 is associated with a specific x86 processor (which is eluding me at the moment), ignoring bit 18 of the address for set selection purposes.

+4


source


Improving caching can be improved on a platform of choice - but of course there is little to guarantee consistent success from launch to launch.

But let's try a few ideas:



  • Create tree_alloc()

    and tree_free()

    that allocate / manage several struct tree_t

    in a group, say 256 on the first call, and then modify them for the next 255 distributions. This will make it harder for random allocations / free calls, but if the tree is large and grows / shrinks uniformly, it might be worth the effort.

  • Make it tree_t

    small. Make the data a pointer.

    struct tree_t {
      data_T *data
      tree_t *left;
      tree_t *right;
     };
    
          

Oops! GTG - will make this wiki

+1


source


malloc

makes no guarantees as to where memory will be allocated. If you want the data to be allocated to take advantage of cache localization, a simple alternative is to allocate an array of structure and then allocate from that array, that is, the object pool. It is basically like writing your own memory allocator, except that it is greatly simplified since each item is the same size. It would also make things much easier if you knew the maximum number of items you would need, so you don't need to add the ability to increase the size of your "memory pool". You will also need to consider thread safety if your distribution / access functions will be available for different threads. There are many other things to consider,these are just a few.

Note: as others have said in the comments, premature optimization is usually not worth it, or worse, counterproductive, but if you want to give it a try, this is the way to go.

Here is a helpful link you can find regarding object pools

0


source


Keep the data structure logic from your memory allocation logic separate and you shouldn't have any problems optimizing your code.

For example, your function add

should not contain malloc

or any free

. Think about how it works sprintf

; the first argument is a pointer to where the string will be written. So make your function add

look like this:

int add(struct tree *destination, struct tree *source) {
    // XXX: Add source into destination
}

      

... and highlight source

before calling add

. So you can choose automatic storage duration (like struct tree foo[128];

how one array should be nice and cacheable) if that applies to you, or you can use malloc

to allocate one node at a time (not cache), or you can use malloc

for allocating large groups of nodes (must be cache compatible, again) ... This gives you an opportunity for optimization, yeh?

You should only optimize after you decide your code is too slow and your profiler tells you why your code is slow.

0


source


This code is not an example of the world's best software development practices, but it does what you are looking for:

tree_t *garbage = NULL;

tree_t *alloc_tree_t() {
    if (garbage == NULL) {
        tree_t *nodearr = malloc(sizeof(tree_t)*1024);
        for(int i = 0; i < 1023; i++) {
            nodearr[i]->left = &(nodearr[i+1]);
        }
        garbage = &(nodearr[0]);
    }
    tree_t *tmp = garbage;
    garbage = tmp->left;
    tmp->left = tmp->right = NULL;
    return tmp;
}

void free_tree_t(tree_t *p) {
    p->left = garbage;
    garbage = p;
    p->right = NULL;
}

      

0


source







All Articles