Why using apache kafka in real time
Lately I have been looking at real-time data processing with storm, flink, etc. All architectures I've used use kafka as a layer between data sources and streaming processor, why should this layer exist?
I think there are three main reasons for using Apache Kafka for real-time processing:
- Distribution
- Performance
- Reliability
In real time, there is a need for fast and reliable delivery of data from data sources to the stream processor. If u doesn't do it well, it can easily become a bottleneck in your real-time processing system. This is where Kafka can help.
In the past, traditional ApacheMQ and RabbitMQ messages weren't particularly good at handling massive amounts of data in real time. For this reason, Linkedin engineers have developed their own Apache Kafka messaging system to deal with this problem.
Distribution: Kafka is natively common, which is consistent with the distribution of the nature of flow processing. Kafka divides the incoming data into offset-ordered sections that are physically distributed across the cluster. This section can then distribute the stream in a distributed manner.
Performance: Kafka was designed to be simple, sacrificing advanced features for performance. Kafka beats traditional messaging systems with a big difference, as you can see in this document . The main reasons are listed below:
-
Kafka producer does not wait for confirmation from broker and send data as fast as broker can handle
-
Kafka has a more efficient storage format with less metadata.
-
The Kafka broker is stateless and does not need to care about the state of consumers.
-
Kafka uses the UNIX sendfile API to efficiently deliver data from the broker to the consumer by reducing the number of data copies and system calls.
Reliability: Kafka serves as a buffer between data sources and the stream processor to handle large data loads. Kafka just simply stores all the incoming data, and consumers are responsible for deciding how much and how quickly they want to process the data. This provides reliable load balancing so that the thread processor is not overwhelmed by too much data.
Kafka's retention policy also makes it easy to recover from failures during processing (Kafka stores all data for 7 days by default). Each consumer keeps track of the offset of the last processed message. For this reason, if any consumer fails, it is easy to roll back to the point right before the failure and re-process without losing information or having to reprogram the entire thread from the beginning.