FixedThreadPool is not parallel enough

I am creating a fixed threadpool using forPool = Executors.newFixedThreadPool(poolSize);

where poolSize is initialized with the number of cores per cpu (say 4). On some launches it works fine and CPU load is constantly 400%.

But sometimes usage drops to 100% and never returns to 400%. I have 1000 tasks scheduled, so this is not the problem. I will catch every exception, but no exception is thrown. So the problem is random and not reproducible, but very present. These are parallel data operations. At the end of each stream, there is a synchronized access to update one variable. It is unlikely that I have a dead end there. In fact, as soon as I see this problem, if I destroy the pool and create a new size of 4, it is still only 100% used. No I / O.

The counter seems to be intuitive for the "FixedThreadPool" Java provisioning. Am I reading the warranty incorrectly? Is only concurrency guaranteed and not parallelism?

And to the question: did you encounter this problem and solve it? If I want parallelism am I doing the right thing?

Thank!

When doing a thread dump: I find that there are 4 threads that are doing their parallel operations. But the usage is still ~ 100%. Here thread dumps are 400% utilization and 100% utilization . I set the number of threads to 16 to run the script. It works at 400% for a while and then drops to 100%. When I use 4 threads it works at 400% and rarely drops to 100%. This is the parallelization code.

****** [MAIN UPDATE] ******

It turns out that if I give the JVM a huge amount of memory to play with, this problem is resolved and performance is not degraded. But I don't know how to use this information to solve this problem. Help!

+3


source to share


8 answers


Given the fact that increasing your heap size makes the problem go away (maybe not for long), the problem is probably GC related.

Is it possible that the Operation implementation generates some state that is stored on the heap between calls

pOperation.perform(...);

      



? If so, then you may have a memory usage problem, possibly a leak. As more tasks complete, there is more data in the heap. The garbage collector has to work harder and harder to try and rebuild it as much as possible, gradually consuming 75% of your available CPU resources. Even destroying the ThreadPool won't help, because it's not where the links are stored, it's in the Operation.

The 16-thread case falling into this problem might be because it generates more states faster (not sure how to implement the operation, it's hard to tell).

And increasing the heap size while keeping the given problem will make the problem go away because you have more room for all this state.

+5


source


I suggest you use the Analyze Your Thread " function to understand the real behavior. It will tell you exactly which threads are running, blocked, or waiting and why.



If you can't / don't want to purchase it, your best option would be to use the Visual VM which is bundled with the JDK to do this analysis. It won't give you detailed information like Yourkit. The following blog post can help you get started with Visual VM: http://marxsoftware.blogspot.in/2009/06/thread-analysis-with-visualvm.html

+2


source


My answer is based on a combination of knowledge about JVM memory management and some guesses about the facts for which I could not find accurate information. I believe your problem is with the Streaming Local Allocation Buffers (TLAB) that Java uses:

The Local Stream Allocation Buffer (TLAB) is an Eden area that is used for single stream allocation. This allows a thread to do object allocation using the local up and down pointers of the stream, which is faster than performing an atomic operation on an up pointer that is shared across threads.

Let's say you have an eden size of 2M and use 4 threads: the JVM can choose a TLAB size (eden / 64) = 32K and each thread gets a TLAB of that size. Once the 32K TLAB stream is exhausted, it needs to purchase a new one that requires global synchronization. Global sync is also required to accommodate objects larger than TLAB.

But to be honest, things are not as simple as I described: the JVM adaptively determines the TLAB flow based on its estimated allocation rate determined on small GCs [ 1 ], which makes TLAB-related behavior even less predictable. However, I can imagine the JVM is scaling the TLAB sizes when more threads are running. This seems to make sense, since the sum of all TLABs must be less than the available eden space (and even part of the eden space in practice to be able to top up the TLAB).

Let's assume a fixed TLAB size for each stream (eden size / (16 * user streams)):

  • for 4 threads this results in a TLAB 32K
  • for 16 threads this results in 8K TLABs

You can imagine that 16 threads that run out of TLAB faster because they are smaller will cause much more locks in the TLAB allocator than 4 threads with 32,000 TLABs.

In conclusion, when you decrease the number of worker threads or increase the memory available to the JVM, more TLABs can be given to threads and the problem is resolved.

https://blogs.oracle.com/daviddetlefs/entry/tlab_sizing_an_annoying_little

+2


source


This is almost certainly GC related.

If you want to be sure to add the following launch flags to your Java program:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

and check the standard output.

You will see lines containing "Full GC" including the time it took: during this time you will see 100% CPU usage.

The default garbage collector on multiprocessor or multicore machines is a bandwidth collector that collects the younger generation in parallel, but uses serial (on one thread) collection for the older generation.

So what is probably happening is that in your 100% cpu example, the GC comes from an old generation that runs on a single thread and therefore only supports one core.

Solution suggestion: use a parallel label and sweep collector using a flag -XX:+UseConcMarkSweepGC

when starting the JVM.

+1


source


Configure JVM

The core of the Java platform is the Java Virtual Machine (JVM). The entire Java Application Server runs inside the JVM. The JVM accepts many startup parameters as command line flags, and some of them are of great importance to application performance. So let's take a look at some important JVM options for server applications.

First, you should allocate as much memory as possible for the JVM using the -Xms (minimum memory) and -Xmx (maximum memory) flags. For example, the -Xms1g -Xmx1g tag allocates 1 GB of RAM to the JVM. If you don't specify memory size in the JVM startup flags, the JVM will limit heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server! More memory allows the application to handle more concurrent Internet sessions and cache more data to improve slow I / O and database operations. Typically we will specify the same amount of memory for both flags to force the server to use all allocated memory on startup. This way, the JVM will not need to dynamically resize the heap at runtime, which is a major cause of JVM instability. For 64-bit servers, make surethat you are running a 64 bit JVM on top of a 64 bit operating system in order to use all the RAM on the server. Otherwise, the JVM will only be able to use 2 GB or less of memory space. 64-bit JVMs are usually only available for JDK 5.0.

With large heap memory, garbage collection (GC) can be a major performance bottleneck. It can take more than ten seconds for the GC to zip through the gigabyte heap. In JDK 1.3 and earlier, GC is a single-threaded operation that stops all other tasks in the JVM. This not only causes long and unpredictable pauses in the application, but also results in very poor performance on multiprocessor computers, as all other processors have to wait in idle mode and one processor is running at 100% to free up heap memory space. It is imperative to choose a JVK 1.4+ JVM that supports parallel and parallel GC operations. In fact, the parallel GC implementation in JVK 1.4 of the JVM series is not very stable. Therefore, we strongly recommend that you upgrade to JDK 5.0. Using command line flags,you can choose from the following GC algorithms. Both are optimized for multiprocessor computers.

  • If your priority is to increase the overall application throughput and you can tolerate occasional GC pauses, you should use -XX: UseParallelGC and -XX: UseParallelOldGC (the latter only available in JDK 5.0) to enable parallel GC. Concurrent GC uses all available CPUs to perform the GC operation, and therefore is much faster than the standard single GC thread. It still stops all other activity in the JVM during GC, however.
  • If you need to minimize GC pause, you can use -XX: + Use the UseConcMarkSweepGC flag to enable parallel GC. The parallel GC still stops the JVM and uses the parallel GC to clean up short-lived objects. However, it cleans up long-lived objects from the heap using a background thread running in parallel with other JVM threads. Matching GC drastically reduces GC pause, but background thread control adds to the system overhead and reduces overall throughput.

In addition, there are several JVM parameters that you can tweak to optimize GC operations.

  • On 64-bit systems, the call stack for each thread is allocated 1 MB of memory. Most threads don't use that much space. Using -XX: ThreadStackSize = 256k you can reduce the stack size to 256k to allow more threads.
  • Use the -XX: + DisableExplicitGC flag to ignore explicit application calls to System.gc (). If the application calls this method frequently, then we might be doing a lot of unnecessary GC.
  • The -Xmn flag allows you to manually set the "young generation" size for short lived objects. If your application generates many new objects, you can significantly improve GC by increasing this value. The size of the "young generation" should almost never be more than 50% of the heap.

Since GC has a big impact on performance, the JVM provides several flags to help you fine tune the GC algorithm for your particular server and application. It is beyond the scope of this article to discuss GC algorithms and tuning tips in detail, but we would like to point out that the JVK 5.0 JVM comes with an adaptive GC tuning feature called ergonomics. It can automatically optimize the GC algorithm parameters based on the underlying hardware, the application itself, and user-defined desired goals (such as maximum pause time and desired throughput). This saves you time trying to use different combinations of GC parameters yourself. Ergonomics is another good reason to upgrade to JDK 5.0.Interested readers can refer to Tuning Garbage Collection with Java Virtual Machine 5.0. If the GC algorithm is misconfigured, it is relatively easy to spot problems during the testing phase of your application. In the next section, we will discuss several ways to diagnose GC problems in the JVM.

Finally, make sure you start the JVM with the -server flag. It optimizes the Just-In-Time (JIT) compiler for slower startup times for faster runtime performance. There are more JVM flags that we haven't discussed; for more information on this, check out the JVM Variants documentation page.

Link: http://onjava.com/onjava/2006/11/01/scaling-enterprise-java-on-64-bit-multi-core.html

+1


source


A total CPU utilization of 100% assumes you wrote single threaded. that is, you can have any number of concurrent tasks, but due to blocking, only one can run at a time.

If you have a high IO you can get less than 400%, but you are unlikely to get a round amount of CPU usage. for example you can see 38%, 259%, 72%, 9%, etc. (also probably a jump)

A common problem is blocking data that you use too often. You need to think about how it can be rewritten when locking is done for the shortest period and the smallest part of the total work. Ideally, you want to avoid blocking together.

Using multiple threads means you can use many processors before, but if your code prevents it, you are probably better (i.e. faster) to write code with a single thread, since it will avoid the overhead of locking.

0


source


Since you are using a lock, it is possible that one of your four threads reaches the lock, but then switches to the context - perhaps to start a GC thread. Other threads cannot make progress because they cannot reach the lock. When the thread context switches back, it exits in the critical section and releases the lock, allowing only one other thread to acquire the lock. So now you have two streams. It is possible that when the second thread executes a critical section, the first thread does the next piece of data parallelism, but generates enough garbage to start the GC, and we are back where we started :)

PS This is just the best guess as it's hard to figure out what's going on without some code snippets.

0


source


Increasing the Java heap size usually increases the throughput until the heap is no longer in physical memory. When the size of the heap exceeds the physical memory, the heap starts to swap to disk, resulting in a dramatic decrease in Java performance. Therefore, it is important to set the maximum heap size to a value that allows the heap to be kept in physical memory.

Since you are giving the JVM ~ 90% of the physical memory on the machines, the problem could be IO happening due to paging and memory swaps when trying to allocate memory for more objects. Note that physical memory is also used by other running processes as well as the OS. In addition, after the onset of symptoms after a while, it also indicates a memory leak.

Try to find out how much physical memory is available (not yet) and allocate ~ 90% of the available physical memory to the JVM heap.

  • What happens if you leave the system for a long period of time?

  • Does it ever get back to 400% CPU usage?

  • Have you noticed disk activity when the CPU is at 100% utilization?
  • Can you control which threads run and which blocks, and when?

Take a look at the following link for tweaking: http://java.sun.com/performance/reference/whitepapers/tuning.html#section4

0


source







All Articles