How do I move the stream of events to cold storage?
I have a stream of events (we could also call them "messages" or even just "data") coming from a timed update event broker. An event broker can be Kafka or Amazon Kinesis or Microsoft Event Hubs , although let's say it's Kafka.
My goal is to take this stream of events and put it in cold storage; that is, store data for future analysis via Hadoop / Spark. This means that I would like to take this "chat" stream of events and convert it to "short" files in HDFS. In a cloud environment, I would rather use S3 or Azure Storage instead of HDFS.
I would also like my solution to be cost effective; for example using serialization formats like Avro / ORC to reduce the cost of disk space. I also remind you that this event is stored in cold storage (bonus points at a time and only once).
My main questions are:
- How do people solve this problem?
- Are there components that already handle this scenario?
- Do I need to develop a solution myself?
- At the very least, are they recommended templates?
Ok, we are using kafka with camus to fetch data from kafka to HDFS. Camus supports automatic serialization. You can find more about camus and avro here .
Another option is to use Flume with a Kafka source (or Kafka channel) and HDFS. HDFS sink can be customized to roll with specific dimensions or times.