Why unboxing is 100 times faster than boxing

Why are there so many speed variations between boxing and unboxing operations? The difference is 10 times. When should we take care of this? Last week, Azure support informed us that there is a problem in our application's heap memory. I'm curious to see if this might be related to a box-unboxing issue.

using System;
using System.Diagnostics;

namespace ConsoleBoxing
{
class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Program started");
        var elapsed = Boxing();
        Unboxing(elapsed);
        Console.WriteLine("Program ended");
        Console.Read();
    }

    private static void Unboxing(double boxingtime)
    {
        Stopwatch s = new Stopwatch();
        s.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int a = 33;//DATA GOES TO STACK
            object b = a;//HEAP IS REFERENCED
            int c = (int)b;//unboxing only hEre ....HEAP GOES TO STACK
        }
        s.Stop();

        var UnBoxing =  s.Elapsed.TotalMilliseconds- boxingtime;
        Console.WriteLine("UnBoxing time : " + UnBoxing);
    }

    private static double Boxing()
    {
        Stopwatch s = new Stopwatch();
        s.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int a = 33;
            object b = a;
        }
        s.Stop();
        var elapsed = s.Elapsed.TotalMilliseconds;
        Console.WriteLine("Boxing time : " + elapsed);
        return elapsed;
    }
}
}

      

+3


source to share


8 answers


Think about how to unpack as one instruction to load memory from an object in a register. Maybe with a small amount of settlement addresses and validation check logic. A boxed object is like a class with one field of type boxed. How expensive are these operations? Not really, especially since the read speed of the L1 cache in your test is ~ 100%.

Boxing involves allocating a new object and GC'ing it later. In your code, GC probably fires on allocation 99% of the time.



However, your test is invalid because loops have no side effects. Chances are it's luck that the current JIT can't optimize them. Either way, the loop calculates the result and feeds it to GC.KeepAlive

so that the result can be used. Alternatively, you can run debug mode.

+5


source


Although people have already offered fantastic explanations for why unboxing is faster than boxing. I want to say a little more about the methodology you used to test the performance difference.

Did your result (10x difference in speed) get from the code you posted? If I run this program in release mode, here is the output:

Program started
Boxing time : 0.2741
UnBoxing time : 4.5847
Program ended

      

Whenever I do a performance test, I tend to check that I am actually comparing the operation that I wanted to compare. The compiler can optimize your code. Open the executable file in ILDASM:

Here is the IL for UnBoxing: (I've only included the part that is important)

IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005:  stloc.0
IL_0006:  ldloc.0 
IL_0007:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c:  ldc.i4.0
IL_000d:  stloc.1
IL_000e:  br.s       IL_0025
IL_0010:  ldc.i4.s   33
IL_0012:  stloc.2
IL_0013:  ldloc.2
IL_0014:  box        [mscorlib]System.Int32    //Here is the boxing
IL_0019:  stloc.3
IL_001a:  ldloc.3
IL_001b:  unbox.any  [mscorlib]System.Int32    //Here is the unboxing
IL_0020:  pop
IL_0021:  ldloc.1
IL_0022:  ldc.i4.1
IL_0023:  add
IL_0024:  stloc.1
IL_0025:  ldloc.1
IL_0026:  ldc.i4     0xf4240
IL_002b:  blt.s      IL_0010
IL_002d:  ldloc.0
IL_002e:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

      

And this is the boxing code:

IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005:  stloc.0
IL_0006:  ldloc.0
IL_0007:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c:  ldc.i4.0
IL_000d:  stloc.1
IL_000e:  br.s       IL_0017
IL_0010:  ldc.i4.s   33
IL_0012:  stloc.2
IL_0013:  ldloc.1
IL_0014:  ldc.i4.1
IL_0015:  add
IL_0016:  stloc.1
IL_0017:  ldloc.1
IL_0018:  ldc.i4     0xf4240
IL_001d:  blt.s      IL_0010
IL_001f:  ldloc.0
IL_0020:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

      

No boxing instruction whatsoever in the boxing method . It has been completely removed by the compiler. The boxing method does nothing but repeat an empty loop. So the time measured in UnBoxing becomes the total boxing and unboxing time.

Micro-benchmarking is very vulnerable to compiler tricks. I would advise you to take a look at your IL. This may be different if you are using a different compiler.

I modified your test code a bit:

Boxing method:

private static object Boxing()
{
    Stopwatch s = new Stopwatch();

    int unboxed = 33;
    object boxed = null;

    s.Start();

    for (int i = 0; i < 1000000; i++)
    {
        boxed = unboxed;
    }

    s.Stop();

    var elapsed = s.Elapsed.TotalMilliseconds;
    Console.WriteLine("Boxing time : " + elapsed);

    return boxed;
}

      

And the Unboxing method:

private static int Unboxing()
{
    Stopwatch s = new Stopwatch();

    object boxed = 33;
    int unboxed = 0;

    s.Start();

    for (int i = 0; i < 1000000; i++)
    {
        unboxed = (int)boxed;
    }

    s.Stop();

    var time = s.Elapsed.TotalMilliseconds;
    Console.WriteLine("UnBoxing time : " + time);

    return unboxed;
}

      



So that they can be translated to similar IL:

For the boxing method:

IL_000c:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0011:  ldc.i4.0
IL_0012:  stloc.3
IL_0013:  br.s       IL_0020
IL_0015:  ldloc.1
IL_0016:  box        [mscorlib]System.Int32  //Here is the boxing
IL_001b:  stloc.2
IL_001c:  ldloc.3
IL_001d:  ldc.i4.1
IL_001e:  add
IL_001f:  stloc.3
IL_0020:  ldloc.3
IL_0021:  ldc.i4     0xf4240
IL_0026:  blt.s      IL_0015
IL_0028:  ldloc.0
IL_0029:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

      

For UnBoxing:

IL_0011:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0016:  ldc.i4.0
IL_0017:  stloc.3
IL_0018:  br.s       IL_0025
IL_001a:  ldloc.1
IL_001b:  unbox.any  [mscorlib]System.Int32  //Here is the UnBoxng
IL_0020:  stloc.2
IL_0021:  ldloc.3
IL_0022:  ldc.i4.1
IL_0023:  add
IL_0024:  stloc.3
IL_0025:  ldloc.3
IL_0026:  ldc.i4     0xf4240
IL_002b:  blt.s      IL_001a
IL_002d:  ldloc.0
IL_002e:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

      

Run several loops to remove the cold start effect:

static void Main(string[] args)
{
    Console.WriteLine("Program started");
    for (int i = 0; i < 10; i++)
    {
        Boxing();
        Unboxing();
    }
    Console.WriteLine("Program ended");
    Console.Read();
}

      

Here's the result:

Program started
Boxing time : 3.4814
UnBoxing time : 0.1712
Boxing time : 2.6294
...
Boxing time : 2.4842
UnBoxing time : 0.1712
Program ended

      

Does this prove that unboxing is 10x faster than boxing? Let's check the assembly code with windbg:

0:004> !u 000007fe93b83940
Normal JIT generated code
MicroBenchmarks.Program.Boxing()
...
000007fe`93ca01b3 call    System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
...
//This is the for loop
000007fe`93ca01c2 mov     eax,21h
000007fe`93ca01c7 mov     dword ptr [rsp+20h],eax
000007fe`93ca01cb lea     rdx,[rsp+20h]
000007fe`93ca01d0 lea     rcx,[mscorlib_ni+0x6e92b0 (000007fe`f18b92b0)]
//here is the boxing
000007fe`93ca01d7 call    clr!JIT_BoxFastMP_InlineGetThread (000007fe`f33126d0)   
000007fe`93ca01dc mov     rsi,rax
//loop unrolling. instead of increment i by 1, we are actually incrementing i by 4
000007fe`93ca01df add     edi,4                 
000007fe`93ca01e2 cmp     edi,0F4240h           // 0F4240h = 1000000
000007fe`93ca01e8 jl      000007fe`93ca01c2     // jumps to the line "mov eax,21h"
//end of the for loop
000007fe`93ca01ea mov     rcx,rbx
000007fe`93ca01ed call    System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)

      

Build for UnBoxing:

0:004> !u 000007fe93b83930
Normal JIT generated code
MicroBenchmarks.Program.Unboxing()
Begin 000007fe93ca02c0, size 117
000007fe`93ca02c0 push    rbx
...
000007fe`93ca030a call    System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
000007fe`93ca030f mov     qword ptr [rbx+10h],rax
000007fe`93ca0313 mov     byte ptr [rbx+18h],1
000007fe`93ca0317 xor     eax,eax
000007fe`93ca0319 mov     edi,dword ptr [rdi+8]
000007fe`93ca031c nop     dword ptr [rax]
//This is the for loop
//again, loop unrolling
000007fe`93ca0320 add     eax,4
000007fe`93ca0323 cmp     eax,0F4240h    // 0F4240h = 1000000
000007fe`93ca0328 jl      000007fe`93ca0320  //jumps to "add eax,4"
//end of the for loop
000007fe`93ca032a mov     rcx,rbx
000007fe`93ca032d call    System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)

      

You can see that even if the comparison seems reasonable at the IL level, the JIT can perform another optimization at runtime. The UnBoxing method runs an empty loop again. Until you verify that the code executed for the two methods is comparable, it is very difficult to simply conclude that "unpacking is 10 times faster than boxing"

+3


source


Because boxing includes objects and unboxing includes primitives. The whole purpose of OOP primitives is to improve performance; so it should not seem surprising that he succeeded.

+2


source


Boxing creates a new object on the heap. Like initializing an array:

int[] arr = {10, 20, 30};

      

boxing provides a convenient initialization syntax so you don't need to explicitly use the new operator. But what actually happens is instantiation.

Unboxing is much cheaper: follow the value link in the box and get the value.

Boxing has all the overhead of creating a reference type object on the heap.

Unboxing only has overhead.

+2


source


Consider this: for boxing, you must allocate memory. For unboxing, you don't have to. Considering that unboxing is a trivial operation (especially in your case, when nothing even happens to the result.

+1


source


Boxing and unboxing are costly processes. When a value type is boxed, a completely new object must be created. This can take up to 20 times longer than a simple task. When unpacked, the casting process can take four times as long as the assignment.

+1


source


Why unboxing is 100 time faster than boxing

      

When you set a value type, a new object must be created and the value must be copied to the new object. When unpacking from a boxed instance, only the value needs to be copied. Therefore boxing adds object creation. This is, however, very fast in .NET, so the difference is probably not very big. Try to avoid the entire boxing procedure, especially if you need maximum speed. Remember that boxing creates objects that need to be cleaned up by the garbage collector

+1


source


One of the things that can slow down a program is when you have to move something in and out of memory. Memory access should be avoided unless needed (if you want speed).

If I look at what unboxing and boxing are, you can see that the difference is that boxing allocates memory on the heap, while unboxing pushes a value type variable onto the stack. Stack access is faster than heap and therefore unpacking is faster in your case.

The stack is faster because the access pattern makes it trivial to allocate and deallocate memory from it (the pointer / integer just increases or decreases), and the heap has much more complex accounting related to allocation or free. Also, every byte on the stack is often reused, which means it tends to map to the processor's cache, making it very fast. Another performance hit for the heap is that the heap, which is primarily a global resource, should generally be thread safe, i.e. Each allocation and deal must be - typically synchronized with "all" other heap calls in the program.

I got this information here from SwankyLegg: What and where is the stack and heap?

To see what the difference unpacking and boxing makes to memory (stack and heap) you can look here: http://msdn.microsoft.com/en-us/library/yz2be5wk.aspx

To keep things simple, try using primitive types where you can and don't reference memory if you can. If you really want speed, you should look into caching, prefetching, locking.

+1


source







All Articles