Increase spsc_queue cache access without blocking

I need to be very concerned about speed / latency in my current multi-threaded project.

Cache access is what I am trying to understand better. And I don't understand how to block free queues (like boost :: lockfree :: spsc_queue) to use / use memory at the cache level.

I've seen queues in which a LOB pointer that is supposed to work on the consumer core is put into a queue.

If the consumer core pops an item from the queue, I assume it means that the item (a pointer in this case) has already been loaded into the L2 and L1 caches of the consumer core. But in order to access an element, shouldn't one have to access the pointer itself by finding and loading the element either from the L3 cache or via the interconnect (if the other thread is in a different cpu juice)? If so, perhaps it would be better to just send a copy of the object that can be destroyed by the consumer?

Thank.

+3


source to share


1 answer


C ++ is basically a pay-for-what-you-want eco-system.

Any regular queue will allow you to choose storage semantics (by value or by reference).

However, this time you ordered something special: you ordered a free queue to block. To be free from blocking, it must be able to perform all observable modification operations as atomic operations. This naturally limits the types that can be used directly in these operations.

You might be in doubt that it is even possible to have a value type that is larger than the register size of the host (say int64_t

).

Good question.

Enter Ringbuffers

In fact, any node-based container simply needs to roll up pointers for all the modifying operations that are trivially done by atoms on all modern architectures. But is anything to do with copying multiple different regions of memory, in a non-atomic sequence, really an intractable problem?

Not. Imagine a flat array of POD items. Now, if you treat the array as a circular buffer, you just need to maintain the index of the front and end positions of the buffer atomically. The container could, in its spare time, update the internal dirty leading index while it copies ahead of the external front. (The copy can use ordered memory). Only after the entire copy is known to be complete is the external leading index updated. This update should be in memory order acq_rel / cst [1] .

As long as the container is capable of protecting an invariant that front

never fully wraps around or reaches back

, that's a sweet deal. I think this idea was popularized in the Disruptor (LMAX glory) library. You get mechanical resonance from

  • linear memory access patterns when reading / writing
  • even better if you can make the record size consistent with (multiple) physical cache lines
  • all data is local if the POD contains original links outside of this record

How does Boost spsc_queue

actually work?



  • Yes, spqc_queue stores the raw element values ​​in a contiguous aligned block of memory: (for example, from compile_time_sized_ringbuffer

    , which underlies spsc_queue

    with a statically supplied maximum capacity :)

    typedef typename boost::aligned_storage<max_size * sizeof(T),
                                            boost::alignment_of<T>::value
                                           >::type storage_type;
    
    storage_type storage_;
    
    T * data()
    {
        return static_cast<T*>(storage_.address());
    }
    
          

    (The item type T

    does not have to be POD, but it should be both default and copyable.)

  • Yes, read and write pointers are atomic integral values. Note that the boost developers have tried to apply enough indentation to avoid False Sharing on the read / write cache line: (from ): ringbuffer_base

    static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
    atomic<size_t> write_index_;
    char padding1[padding_size]; /* force read_index and write_index to different cache lines */
    atomic<size_t> read_index_;
    
          

  • In fact, as you can see, there is only an "internal" index on the read or write side. This is possible because there is only one stream of letters, and also only one stream of read, which means that there may be more space than expected at the end of the write operation.

  • Several other optimizations are present:

    • branch prediction hints for supporting platforms ( unlikely()

      )
    • You can click / place a series of items at once. This should improve throughput in case you need to jump from one clipboard / clipboard to another, especially if the size of the raw element is not (integer multiple of) keline
    • using std :: unitialized_copy where possible
    • Calling trivial constructors / destructors will be optimized at instantiation time
    • unitialized_copy will be optimized in memcpy for all major standard library implementations (which means, for example, SSE instructions will be used if your architecture supports it)

Overall, we see a best-in-class possible idea for a ringbuffer

What to use

Boost has given you all the options. You can choose to have your item type indicate your post type. However, as you raised in your question, this level of indirection reduces the locality of the links and may not be optimal.

On the other hand, storing the complete message type in an element type can become costly if copying is expensive. At least try to make the element style fit well in the cache line (usually 64 bytes on Intel).

So in practice, you might consider storing frequently used data directly in this value and referring to less used data using a pointer (the pointer's cost will be low unless traversed).

If you need this "nesting" model, consider using a custom allocator for the mentioned data so you can also create memory access schemes.

Let your profiler guide you.


[1] I suppose acq_rel should work for spsc, but I'm a little rusty on the details. As a general rule of thumb, I consider not writing the code itself without blocking. I recommend everyone else to follow my example :)

+8


source







All Articles