How do I move the stream of events to cold storage?

I have a stream of events (we could also call them "messages" or even just "data") coming from a timed update event broker. An event broker can be Kafka or Amazon Kinesis or Microsoft Event Hubs , although let's say it's Kafka.

My goal is to take this stream of events and put it in cold storage; that is, store data for future analysis via Hadoop / Spark. This means that I would like to take this "chat" stream of events and convert it to "short" files in HDFS. In a cloud environment, I would rather use S3 or Azure Storage instead of HDFS.

I would also like my solution to be cost effective; for example using serialization formats like Avro / ORC to reduce the cost of disk space. I also remind you that this event is stored in cold storage (bonus points at a time and only once).

My main questions are:

  • How do people solve this problem?
  • Are there components that already handle this scenario?
  • Do I need to develop a solution myself?
  • At the very least, are they recommended templates?
+3


source to share


2 answers


Ok, we are using kafka with camus to fetch data from kafka to HDFS. Camus supports automatic serialization. You can find more about camus and avro here .



+3


source


Another option is to use Flume with a Kafka source (or Kafka channel) and HDFS. HDFS sink can be customized to roll with specific dimensions or times.



+1


source







All Articles