Why unboxing is 100 times faster than boxing
Why are there so many speed variations between boxing and unboxing operations? The difference is 10 times. When should we take care of this? Last week, Azure support informed us that there is a problem in our application's heap memory. I'm curious to see if this might be related to a box-unboxing issue.
using System;
using System.Diagnostics;
namespace ConsoleBoxing
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Program started");
var elapsed = Boxing();
Unboxing(elapsed);
Console.WriteLine("Program ended");
Console.Read();
}
private static void Unboxing(double boxingtime)
{
Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i < 1000000; i++)
{
int a = 33;//DATA GOES TO STACK
object b = a;//HEAP IS REFERENCED
int c = (int)b;//unboxing only hEre ....HEAP GOES TO STACK
}
s.Stop();
var UnBoxing = s.Elapsed.TotalMilliseconds- boxingtime;
Console.WriteLine("UnBoxing time : " + UnBoxing);
}
private static double Boxing()
{
Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i < 1000000; i++)
{
int a = 33;
object b = a;
}
s.Stop();
var elapsed = s.Elapsed.TotalMilliseconds;
Console.WriteLine("Boxing time : " + elapsed);
return elapsed;
}
}
}
source to share
Think about how to unpack as one instruction to load memory from an object in a register. Maybe with a small amount of settlement addresses and validation check logic. A boxed object is like a class with one field of type boxed. How expensive are these operations? Not really, especially since the read speed of the L1 cache in your test is ~ 100%.
Boxing involves allocating a new object and GC'ing it later. In your code, GC probably fires on allocation 99% of the time.
However, your test is invalid because loops have no side effects. Chances are it's luck that the current JIT can't optimize them. Either way, the loop calculates the result and feeds it to GC.KeepAlive
so that the result can be used. Alternatively, you can run debug mode.
source to share
Although people have already offered fantastic explanations for why unboxing is faster than boxing. I want to say a little more about the methodology you used to test the performance difference.
Did your result (10x difference in speed) get from the code you posted? If I run this program in release mode, here is the output:
Program started
Boxing time : 0.2741
UnBoxing time : 4.5847
Program ended
Whenever I do a performance test, I tend to check that I am actually comparing the operation that I wanted to compare. The compiler can optimize your code. Open the executable file in ILDASM:
Here is the IL for UnBoxing: (I've only included the part that is important)
IL_0000: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005: stloc.0
IL_0006: ldloc.0
IL_0007: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c: ldc.i4.0
IL_000d: stloc.1
IL_000e: br.s IL_0025
IL_0010: ldc.i4.s 33
IL_0012: stloc.2
IL_0013: ldloc.2
IL_0014: box [mscorlib]System.Int32 //Here is the boxing
IL_0019: stloc.3
IL_001a: ldloc.3
IL_001b: unbox.any [mscorlib]System.Int32 //Here is the unboxing
IL_0020: pop
IL_0021: ldloc.1
IL_0022: ldc.i4.1
IL_0023: add
IL_0024: stloc.1
IL_0025: ldloc.1
IL_0026: ldc.i4 0xf4240
IL_002b: blt.s IL_0010
IL_002d: ldloc.0
IL_002e: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
And this is the boxing code:
IL_0000: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005: stloc.0
IL_0006: ldloc.0
IL_0007: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c: ldc.i4.0
IL_000d: stloc.1
IL_000e: br.s IL_0017
IL_0010: ldc.i4.s 33
IL_0012: stloc.2
IL_0013: ldloc.1
IL_0014: ldc.i4.1
IL_0015: add
IL_0016: stloc.1
IL_0017: ldloc.1
IL_0018: ldc.i4 0xf4240
IL_001d: blt.s IL_0010
IL_001f: ldloc.0
IL_0020: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
No boxing instruction whatsoever in the boxing method . It has been completely removed by the compiler. The boxing method does nothing but repeat an empty loop. So the time measured in UnBoxing becomes the total boxing and unboxing time.
Micro-benchmarking is very vulnerable to compiler tricks. I would advise you to take a look at your IL. This may be different if you are using a different compiler.
I modified your test code a bit:
Boxing method:
private static object Boxing()
{
Stopwatch s = new Stopwatch();
int unboxed = 33;
object boxed = null;
s.Start();
for (int i = 0; i < 1000000; i++)
{
boxed = unboxed;
}
s.Stop();
var elapsed = s.Elapsed.TotalMilliseconds;
Console.WriteLine("Boxing time : " + elapsed);
return boxed;
}
And the Unboxing method:
private static int Unboxing()
{
Stopwatch s = new Stopwatch();
object boxed = 33;
int unboxed = 0;
s.Start();
for (int i = 0; i < 1000000; i++)
{
unboxed = (int)boxed;
}
s.Stop();
var time = s.Elapsed.TotalMilliseconds;
Console.WriteLine("UnBoxing time : " + time);
return unboxed;
}
So that they can be translated to similar IL:
For the boxing method:
IL_000c: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0011: ldc.i4.0
IL_0012: stloc.3
IL_0013: br.s IL_0020
IL_0015: ldloc.1
IL_0016: box [mscorlib]System.Int32 //Here is the boxing
IL_001b: stloc.2
IL_001c: ldloc.3
IL_001d: ldc.i4.1
IL_001e: add
IL_001f: stloc.3
IL_0020: ldloc.3
IL_0021: ldc.i4 0xf4240
IL_0026: blt.s IL_0015
IL_0028: ldloc.0
IL_0029: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
For UnBoxing:
IL_0011: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0016: ldc.i4.0
IL_0017: stloc.3
IL_0018: br.s IL_0025
IL_001a: ldloc.1
IL_001b: unbox.any [mscorlib]System.Int32 //Here is the UnBoxng
IL_0020: stloc.2
IL_0021: ldloc.3
IL_0022: ldc.i4.1
IL_0023: add
IL_0024: stloc.3
IL_0025: ldloc.3
IL_0026: ldc.i4 0xf4240
IL_002b: blt.s IL_001a
IL_002d: ldloc.0
IL_002e: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
Run several loops to remove the cold start effect:
static void Main(string[] args)
{
Console.WriteLine("Program started");
for (int i = 0; i < 10; i++)
{
Boxing();
Unboxing();
}
Console.WriteLine("Program ended");
Console.Read();
}
Here's the result:
Program started
Boxing time : 3.4814
UnBoxing time : 0.1712
Boxing time : 2.6294
...
Boxing time : 2.4842
UnBoxing time : 0.1712
Program ended
Does this prove that unboxing is 10x faster than boxing? Let's check the assembly code with windbg:
0:004> !u 000007fe93b83940
Normal JIT generated code
MicroBenchmarks.Program.Boxing()
...
000007fe`93ca01b3 call System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
...
//This is the for loop
000007fe`93ca01c2 mov eax,21h
000007fe`93ca01c7 mov dword ptr [rsp+20h],eax
000007fe`93ca01cb lea rdx,[rsp+20h]
000007fe`93ca01d0 lea rcx,[mscorlib_ni+0x6e92b0 (000007fe`f18b92b0)]
//here is the boxing
000007fe`93ca01d7 call clr!JIT_BoxFastMP_InlineGetThread (000007fe`f33126d0)
000007fe`93ca01dc mov rsi,rax
//loop unrolling. instead of increment i by 1, we are actually incrementing i by 4
000007fe`93ca01df add edi,4
000007fe`93ca01e2 cmp edi,0F4240h // 0F4240h = 1000000
000007fe`93ca01e8 jl 000007fe`93ca01c2 // jumps to the line "mov eax,21h"
//end of the for loop
000007fe`93ca01ea mov rcx,rbx
000007fe`93ca01ed call System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)
Build for UnBoxing:
0:004> !u 000007fe93b83930
Normal JIT generated code
MicroBenchmarks.Program.Unboxing()
Begin 000007fe93ca02c0, size 117
000007fe`93ca02c0 push rbx
...
000007fe`93ca030a call System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
000007fe`93ca030f mov qword ptr [rbx+10h],rax
000007fe`93ca0313 mov byte ptr [rbx+18h],1
000007fe`93ca0317 xor eax,eax
000007fe`93ca0319 mov edi,dword ptr [rdi+8]
000007fe`93ca031c nop dword ptr [rax]
//This is the for loop
//again, loop unrolling
000007fe`93ca0320 add eax,4
000007fe`93ca0323 cmp eax,0F4240h // 0F4240h = 1000000
000007fe`93ca0328 jl 000007fe`93ca0320 //jumps to "add eax,4"
//end of the for loop
000007fe`93ca032a mov rcx,rbx
000007fe`93ca032d call System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)
You can see that even if the comparison seems reasonable at the IL level, the JIT can perform another optimization at runtime. The UnBoxing method runs an empty loop again. Until you verify that the code executed for the two methods is comparable, it is very difficult to simply conclude that "unpacking is 10 times faster than boxing"
source to share
Boxing creates a new object on the heap. Like initializing an array:
int[] arr = {10, 20, 30};
boxing provides a convenient initialization syntax so you don't need to explicitly use the new operator. But what actually happens is instantiation.
Unboxing is much cheaper: follow the value link in the box and get the value.
Boxing has all the overhead of creating a reference type object on the heap.
Unboxing only has overhead.
source to share
Why unboxing is 100 time faster than boxing
When you set a value type, a new object must be created and the value must be copied to the new object. When unpacking from a boxed instance, only the value needs to be copied. Therefore boxing adds object creation. This is, however, very fast in .NET, so the difference is probably not very big. Try to avoid the entire boxing procedure, especially if you need maximum speed. Remember that boxing creates objects that need to be cleaned up by the garbage collector
source to share
One of the things that can slow down a program is when you have to move something in and out of memory. Memory access should be avoided unless needed (if you want speed).
If I look at what unboxing and boxing are, you can see that the difference is that boxing allocates memory on the heap, while unboxing pushes a value type variable onto the stack. Stack access is faster than heap and therefore unpacking is faster in your case.
The stack is faster because the access pattern makes it trivial to allocate and deallocate memory from it (the pointer / integer just increases or decreases), and the heap has much more complex accounting related to allocation or free. Also, every byte on the stack is often reused, which means it tends to map to the processor's cache, making it very fast. Another performance hit for the heap is that the heap, which is primarily a global resource, should generally be thread safe, i.e. Each allocation and deal must be - typically synchronized with "all" other heap calls in the program.
I got this information here from SwankyLegg: What and where is the stack and heap?
To see what the difference unpacking and boxing makes to memory (stack and heap) you can look here: http://msdn.microsoft.com/en-us/library/yz2be5wk.aspx
To keep things simple, try using primitive types where you can and don't reference memory if you can. If you really want speed, you should look into caching, prefetching, locking.
source to share