Keyboard separator

Question

Keyboard separator

A few days ago, I started to tinker with cached code and rolled around some other construct to determine how performance changes if I allocate variables on the stack or heap, and how different memory layouts scale with linear tasks like iteration and search.

I'm not dealing with allocation times, just processing performance.

The benchmarks are not accurate, but at least they should indicate some related numbers as performance may vary.

First of all, I compared the performance between std :: array and the performance of a vector.

Check code for array:

int main()
{
    std::array<mango::int16, 5000000> v;

    mango::delta_timer timer; //simple timer class

    for (int i = 0; 5000000 > i; ++i)
    {
        v[i] = i; //I know that i will overflow but that no problem in this case
    }

    timer.start();
    mango::for_each(v.begin(), v.end(), [](mango::int16& i)->void {++i; });
    timer.stop();

    std::cout << (double)timer.totalTime();

    mango::mgetch(); /*crossplatform wrapper for _getch() --> supposed to
    give me a point where I can exit the program without printing the results*/

    mango::for_each(v.begin(), v.end(), print); /*print the entire
    vector and hope that this will prevent the compiler from optimizing the array away*/

    return 0;
}

Code for regular vector:

int main()
{
    std::vector<mango::int16> v;
    v.reserve(5000000);

    mango::delta_timer timer;

    for (int i = 0; 5000000 > i; ++i)
    {
        v.push_back(i);
    }

    timer.start();
    mango::for_each(v.begin(), v.end(), [](mango::int16& i)->void {++i; });
    timer.stop();

    std::cout << (double)timer.totalTime();

    mango::mgetch();

    mango::for_each(v.begin(), v.end(), print);

    return 0;
}

The for_each parameter in the array took 0.003 to 0.004 seconds, and the for_each parameter on the vector was 0.005 to 0.007 seconds.

After the first tests, I deployed a very thin and minimalistic allocator to try and get the same performance with stack memory.

The distributor looks like this:

class block_allocator
{
public:
    block_allocator(mango::int32 n, mango::int32 bsize, mango::int32 id)
        : m_Memory(new mango::byte[n * bsize]), m_Capacity(n), m_BlockSize(bsize), m_ID(id), m_Blocks(n)
    {
        for (mango::byte* iterator = (mango::byte*)m_Memory; ((mango::byte*)m_Memory + n * bsize) > iterator; iterator += bsize)
        {
            m_Blocks.push_back(iterator);
        }
    }

    ~block_allocator()
    {
        delete[](mango::byte*)m_Memory;
        m_Memory = nullptr;
    }

    void* allocate(mango::uint32 n)
    {
        if (m_Blocks.empty())
        {
            throw mango::exception::out_of_range(mango::to_string(m_ID) + std::string(" allocator went out of range"), "out_of_range");
        }

        void* block = m_Blocks.back();
        m_Blocks.pop_back();

        return block;
    }

    void deallocate(void* target)
    {
        if (m_Blocks.size() == m_Capacity)
        {
            delete[](mango::byte*)target;
        }

        m_Blocks.push_back(target);
    }

private:
    void*                m_Memory;

    mango::int32         m_Capacity;
    mango::int32         m_BlockSize;
    mango::int32         m_ID;

    std::vector<void*>   m_Blocks;
};

This is a very minimalistic piece to test and not suitable for productive use!

This is my test case with a distributor:

int main()
{
    std::array<mango::int16*, 5000000> v;

    mango::delta_timer timer;

    for (int i = 0; 5000000 > i; ++i)
    {
        v[i] = allocate_int(i); //allocates an int with the allocator
    }

    timer.start();
    mango::for_each(v.begin(), v.end(), [](mango::int16* i)->void {++(*i); });
    timer.stop();

    std::cout << (double)timer.totalTime();

    mango::mgetch();

    mango::for_each(v.begin(), v.end(), print);

    return 0;
}

In this example, for_each performance fell between 0.003 and 0.004, as did the first array example.

There is no cleanup in any of these examples, I know.

So here's the question: since I had to increase the stack size in visual studio 2015 in order to run this sample (otherwise a stack overflow will occur), and the fact that the stack will be slower as the size increases, which would be the preferred way to use caching code?

Using a caching allocator that keeps objects close to each other on the heap achieves equal performance when using the stack (this may vary across examples, but even for maximum stack performance, it will be sufficient for most programs I think).

Wouldn't it be more efficient to create the right allocator and store a lot of stuff on the heap and store a number of "real" allocations instead of overusing the stack? I ask this because I read "use the stack as often as you can" all over the internet, and I am concerned that this approach is not as simple as many people think.

Thank.

+3

c ++ performance stack allocation

Mango June 13. 17 at 10:52

source to share

1 answer

Sneftel · Accepted Answer · 2017-06-13T11:00:53+0000

Don't overestimate the value in the cache to keep everything on the stack. Yes, it is nice for newly allocated objects to fit into lines that are already cached. But ex. Haswell, cache lines are only 64 bytes, so you ran out of touch with regards to cache pretty quickly. (There is some utility for cache allocation, but this is secondary.) And if you write code in which you can actually get extra cache consistency, then you usually work with arrays of large numbers that are contiguous no matter where they are.

The "use stack, not the heap" advice, I think, is to avoid indirection.

With all that said, there is little benefit to a separate allocator that assumes and benefits from LIFO allocation patterns. But this is due to the reduction in bookkeeping costs, not because of the convenience of the cache.

Keyboard separator

More articles: