Why are floating point operations much faster with a warm-up phase?

I first wanted to test something different with floating point performance optimizations in Java, namely the performance difference between division by 5.0f

and multiplication by 0.2f

(multiplication seems to be slower without warming up, but faster with a factor of about 1.5, respectively).

After examining the results, I noticed that I forgot to add a warm-up phase, as was often suggested when optimizing for performance, so I added one. And to my complete surprise, it turned out to be about 25 times faster on average over several test runs.

I tested it with the following code:

public static void main(String args[])
{
    float[] test = new float[10000];
    float[] test_copy;

    //warmup
    for (int i = 0; i < 1000; i++)
    {
        fillRandom(test);

        test_copy = test.clone();

        divideByTwo(test);
        multiplyWithOneHalf(test_copy);
    }

    long divisionTime = 0L;
    long multiplicationTime = 0L;

    for (int i = 0; i < 1000; i++)
    {
        fillRandom(test);

        test_copy = test.clone();

        divisionTime += divideByTwo(test);
        multiplicationTime += multiplyWithOneHalf(test_copy);
    }

    System.out.println("Divide by 5.0f: " + divisionTime);
    System.out.println("Multiply with 0.2f: " + multiplicationTime);
}

public static long divideByTwo(float[] data)
{
    long before = System.nanoTime();

    for (float f : data)
    {
        f /= 5.0f;
    }

    return System.nanoTime() - before;
}

public static long multiplyWithOneHalf(float[] data)
{
    long before = System.nanoTime();

    for (float f : data)
    {
        f *= 0.2f;
    }

    return System.nanoTime() - before;
}

public static void fillRandom(float[] data)
{
    Random random = new Random();

    for (float f : data)
    {
        f = random.nextInt() * random.nextFloat();
    }
}

      

Results without warm-up phase:

Divide by 5.0f: 382224
Multiply with 0.2f: 490765

      

Results with a warm-up phase:

Divide by 5.0f: 22081
Multiply with 0.2f: 10885

      

Another interesting change that I cannot explain is the twist in which operation is faster (division versus multiplication). As mentioned earlier, without a warm-up, the split seems to be a little faster, while the warm-up seems to be half as slow.

I tried adding an initialization block by setting the values ​​to something random, but it didn't affect the results and didn't add multiple warm-up phases. The numbers on which the methods operate are the same, so this cannot be the reason.

What is the reason for this behavior? What is this warm-up phase and how does it affect performance, why are the operations so much faster with the warm-up phase, and why is there a queue where the operation is faster?

+3


source to share


2 answers


Before warming up Java will run bytecodes through an interpreter, consider how to write a program that could execute java bytecodes in java. After warming up, hotspot will generate its own picker for the processor you are running on; using this set of cpus functions. There is a significant performance difference between the two, the interpreter will run many cpu commands for a single byte code where hotspot generates native assembler code like gcc when compiling C code. In this, the difference between split and multiply will eventually get to the processor, by which it is running and it will be just one cpu command.

The second piece of the puzzle is the access point, which also records statistics that measure the behavior of your code at runtime when it decides to optimize the code, then it will use those statistics to perform optimizations that are not necessarily possible at compile time. For example, it can reduce the cost of null checks, misuse of branches, and polymorphic method calls.

In short, the results of preheating should be discarded.

Brian Goetz wrote a very good article here on the subject.

========

APPENDIX: An overview of what "JVM Warm-up" stands for



The JVM "warm-up" is a free phrase and is no longer strictly one or JVM stage. People tend to use it to refer to an idea of ​​where JVM performance stabilizes after compiling JVM bytecodes into native bytecodes. Truth be told, when someone starts scratching beneath the surface and digs deeper into the internals of the JVM, it's hard not to be surprised at how much Hotspot is doing for us. My goal here is just to give you a better idea of ​​what Hotspot can do in the name of productivity, for more details I recommend reading articles by Brian Goetz, Doug Lee, John Rose, Cliff-Click and Gil Tene (among many others ).

As mentioned, the JVM is launched through Java through its interpreter. Not 100% strictly speaking, you can think of the interpreter as a big switch statement and a loop that iterates over each JVM bytecode (command). Each case in a switch statement is JVM bytecode such as adding two values ​​together, calling a method, calling a constructor, etc. The overhead of iterating and jumping around teams is very high. Thus, executing a single instruction will typically use more than 10x the number of build instructions, which means> 10x slower as the hardware has to execute so many other instructions and the caches will be polluted by this interpreter code that we would ideally prefer to focus on our real program. Think back to the early days of Java,when Java earned a reputation for being very slow; this is because it was originally a fully interpreted language.

Later JIT compilers were added to Java, these compilers compiled Java methods with native processor instructions just before calling the methods. This eliminated all the interpreter overhead and allowed the execution of the code at the hardware level. Although execution inside hardware is much faster, this additional compilation created a startup rack for Java. And this was partly due to the terminology of the "warm-up phase".

The introduction of Hotspot to the JVM was a game changer. The JVM will now start faster because it will start working with Java programs using its interpreter, and individual Java methods will be compiled in a background thread and replaced on the fly at runtime. Generating native code can also be done at different levels of optimization, sometimes using very aggressive optimizations that are, strictly speaking, wrong, and then de-optimizing and re-optimizing on the fly when necessary to ensure correct behavior. For example, class hierarchies imply a lot of cost, to figure out which method will be named as Hotspot has to look up the hierarchy and find the target method. The hotspot can get very smart here and if it notices that only one class is loaded,he can assume that this will always be the case, and optimize and inline the methods as such. If another class loads, which now tells Hotspot that there is actually a decision between the two methods to be made, then it will remove its previous assumptions and recompile on the fly. The full list of optimizations that can be made in different circumstances is very impressive and constantly changing. The ability to use access points to record information and statistics about the environment in which it operates and the workload it is currently experiencing makes the optimization that runs very flexible and dynamic. In fact, it is very possible that during the entire lifecycle of one Java process,that the code for this program will be regenerated many times as the nature of the workload changes. Perhaps giving Hotspot a big advantage over more traditional static compilation, and in many ways why many Java codes can be considered as fast as writing C code. It also makes understanding microbusiness much more difficult; in fact it makes the JVM code much harder for developers at Oracle to understand, work with, and diagnose problems. Take a minute to kick up a pint for those guys, Hotspot and the JVM in general - a fantastic engineering triumph that came to the fore at a time when people said it was impossible. It is worth remembering that since ten years later it is a rather complicated beast;)giving Hotspot a big advantage over more traditional static compilation, and in many ways why many Java codes can be considered as fast as writing C code. It also makes understanding microbusiness much more difficult; in fact it makes the JVM code much harder for developers at Oracle to understand, work with, and diagnose problems. Take a minute to kick up a pint for those guys, Hotspot and the JVM in general - a fantastic engineering triumph that came to the fore at a time when people said it was impossible. It is worth remembering that since ten years later it is a rather complicated beast;)giving Hotspot a big advantage over more traditional static compilation, and in many ways why many Java codes can be considered as fast as writing C code. It also makes understanding microbusiness much more difficult; in fact it makes the JVM code much harder for developers at Oracle to understand, work with, and diagnose problems. Take a minute to kick up a pint for those guys, Hotspot and the JVM in general - a fantastic engineering triumph that came to the fore at a time when people said it was impossible. It is worth remembering that since ten years later it is a rather complicated beast;)in fact it makes the JVM code much harder for developers at Oracle to understand, work with, and diagnose problems. Take a minute to kick up a pint for those guys, Hotspot and the JVM in general - a fantastic engineering triumph that came to the fore at a time when people said it was impossible. It is worth remembering that since ten years later it is a rather complicated beast;)in fact it makes the JVM code much harder for developers at Oracle to understand, work with, and diagnose problems. Take a minute to kick up a pint for those guys, Hotspot and the JVM in general - a fantastic engineering triumph that came to the fore at a time when people said it was impossible. It is worth remembering that since ten years later it is a rather complicated beast;)

So, with this context in mind, in the summary we talk about warming up the JVM in micro-objects by running the target code more than 10k times and discarding the results to give the JVM the ability to collect statistics and optimize hot spots' in the code. 10k is a magic number because the Hotspot server implementation waits for this many method calls or loop iterations before it starts considering optimization. I would also advise about method calls between main test runs, since while the access point can perform "stack swap" ( OSR ), this is not common in real applications and does not behave exactly the same as replacing all method implementations ...

+12


source


You are not measuring anything useful "no warm-up phase"; you measure the speed of the code being interpreted, how long it takes for the generated replacement on the stack to be generated. Maybe divisions cause compilation earlier.



There are many tutorials and various packages for creating microobjects that do not suffer from these problems. I would advise you to read the guidelines and use pre-made packages if you intend to keep doing this kind of thing.

+4


source







All Articles