Why isn't it faster to update an array of structures than an array of classes?

In order to prepare the optimization in the existing software structure, I performed an offline performance test so that I could estimate the potential profit before spending a lot of time on it.

Situation

There are N

different types of components, some of which implement the interface IUpdatable

- these are interesting. They are grouped into objects M

, each of which maintains a list of components. Updating them works like this:

foreach (GroupObject obj in objects)
{
    foreach (Component comp in obj.Components)
    {
        IUpdatable updatable = comp as IUpdatable;
        if (updatable != null)
            updatable.Update();
    }
}

      

Optimization

My goal was to optimize these updates for a large number of objects and grouping components. First, be sure to update all components of the same kind in a row by caching them into one array per kind. Essentially, it is:

foreach (IUpdatable[] compOfType in typeSortedComponents)
{
    foreach (IUpdatable updatable in compOfType)
    {
        updatable.Update();
    }
}

      

The thought is that the JIT or processor can have an easier time working on the same type of object over and over than the shuffled version.

In the next step, I wanted to further improve the situation by making sure that all data for one type of component is aligned in memory - by storing it in a struct array, something like this:

foreach (ComponentDataStruct[] compDataOfType in typeSortedComponentData)
{
    for (int i = 0; i < compDataOfType.Length; i++)
    {
        compDataOfType[i].Update();
    }
}

      

Problem

In my standalone benchmarks, there is no significant gain from any of these changes. I do not know why. No significant performance improvement means 10,000 components, each batch runs 100 refresh cycles, all basic tests take about 85 milliseconds +/- 2 milliseconds.

(The only difference comes from the introduction of the as

cast and check if

, but that's not what I've tested for.)

  • All tests were run in Release mode with no debugger attached.
  • Outside noise has been reduced with this code:

        currentProc.ProcessorAffinity = new IntPtr(2);
        currentProc.PriorityClass = ProcessPriorityClass.High;
        currentThread.Priority = ThreadPriority.Highest;
    
          

  • Each test actually did some primitive math work, so it didn't just measure empty method calls that could be optimized.

  • Garbage collection was done explicitly before each test to eliminate this interference.
  • Full source code (VS Solution, Build and Run) is available here

I was expecting a significant change due to memory alignment and repetition in update templates. So my main question is actually: Why haven't I been able to measure significant improvement? Am I missing something important? Am I missing something in my tests?

+3


source to share


1 answer


The main reason you may traditionally prefer the latter implementation is because of the Locality of Reference . If the contents of the array fit into the processor's cache, your code is much faster. Conversely, if you have a lot of cache misses, your code runs much slower.

Your mistake, I suspect, is that the objects in your first test probably already have good reference locality. If you allocate several small objects at once, these objects are likely to be contiguous in memory, even if they are on the heap. (I'm looking for a better source for this, but I've experienced the same thing in my own work). Even if they aren't already contiguous, the GC can move them around so that they are. Since modern processors have large caches, maybe the entire data structure is suitable for L2 cache, since there is not so much competition with it. Even though the cache is small, modern CPUs have gotten a very good grasp of usage patterns and prefetching.

It could also be that your code needs to embed / unpack your structures. However, this is unlikely if the performance is indeed the same.

The big thing with low-level things like this in C # is what you really need to do: a) trust the framework to do its job, or b) a real-world profile after you've identified the low-performance problem. appreciate it could be a toy project, or you can just play around with the giggle memory optimizations, but a priori optimization as you did in your OP is unlikely to bring noticeable project-wide performance improvements.

I haven't looked at your code in detail yet, but I suspect your problem here is unrealistic. With more memory pressure and especially more dynamic component allocation, you can see the difference in performance you expect. Again, you can't, which is why this is so important to the profile.



It's worth noting that if you know ahead of time that strict manual memory localization optimization is critical to the correct functionality of your application, you may need to consider whether a managed language is the right tool for the job.

Edit: Yes, the problem is almost certainly here: -

public static void PrepareTest()
{
  data = new Base[Program.ObjCount]; // 10000
  for (int i = 0; i < data.Length; i++)
    data[i] = new Data(); // Data consists of four floats
}

      

Those 10,000 copies Data

are probably contiguous in memory. Plus, they probably all fit into your cache anyway, so I doubt you'll see the performance impact on cache misses in this benchmark.

+6


source







All Articles