Parallelism behavior of threading engines

I am studying Storm and Samza to understand how thread processors work and realized that they are both standalone applications and in order to handle the event I need to add it to a queue that is also related to the thread processing engine. This means I need to add an event to the queue (this is also a standalone application, say Kafka) and Storm will fetch the event from the queue and process it in the worker process. And if I have multiple bolts, each bolt will be machined with different workflows. (This is one of the things I don't really understand, I see a company that uses more than 20 bolts in production and each event is carried between bolts in a specific path)

However, I don't understand why I need such complex systems. Processes involve too many I / O operations (my program -> queue -> storm -> bolts) and make them much more difficult to monitor and debug.

Instead, if I'm collecting data from web servers, why not just use the same node to handle events? The operations will already be distributed across nodes using load balancers that I use for web servers. I can create executors on the same JVM instances and dispatch events asynchronously from the web server to the executor without invoking any additional I / O requests. I can also watch the executors on the web servers and make sure that the executor handles events (at least once, or one single processing request). This way it will be much easier to manage my application and since it doesn't require a lot of I / O it will be faster compared to the other waywhich involves sending data to another node over the network (which is also not reliable) and process it in node.

Chances are I missed something because I know that many companies are actively using Storm, and many of those I know recommend Storm or other threading engines for real-time event handling, but I just don't get it.

+3


source to share


2 answers


I understand that the purpose of using a framework like Storm is to offload heavy processing (whether it's binding, I / O bound, or both) from application web servers and web servers, and keep them responsive.

Please note that each application server may have to serve a large number of concurrent requests, and not all of them involve processing a stream. If the application server is already handling a significant load of events, this can become a bottleneck for lighter requests as server resources (eg, CPU usage, memory, disk contention, etc.) will already be tied to heavier processing requests.



If the actual load you need to face isn't that heavy, or if it can simply be handled by adding application server instances, then it certainly doesn't make sense to staff your architecture / topology, which could actually slow down the whole thing. It depends on your performance and load requirements and how much (virtual) hardware you can throw at the problem. As usual, benchmarking based on your load requirements will help you decide where to go.

+2


source


You are correct in thinking that sending data over the network will take longer from the total processing time. However, these frameworks (Storm, Spark, Samza, Flink) were created to handle large amounts of data that could potentially not fit in the memory of a single computer. Thus, if we use more than one computer to process data, we can achieve parallelism. And, following your question about network latency. Yes! it's a trade-off to consider. The developer should be aware that they are implementing programs for deployment in a parallel environment. The way they build the application will also affect the amount of data transferred over the network.



0


source







All Articles