HDFS data availability event notification?

What would be the best approach to implementing a notification system for Hadoop to make the data available, so that whenever new data comes in, a notification is generated that can be used by the job management framework to start its work, which depends on that data. The main issue here is that once the data is available the job should be started instead of polling the NameNode for data availability?

0


source to share


1 answer


What I would do is use a producer / consumer model that can communicate with each other using a queue like Amazon SQS.

The manufacturer will maintain a list of watched directories and hadoop fs -test -e /path/to/watched/dir

every x seconds (where x should be a parameter), and if the command returns 0 with $?

, you can post a message to the queue. The content of the post can be just the name of the directory that just appeared, or you can add some metadata and send it as a JSON object, for example with additional fields.



On the other hand, the consumer will listen to the queue every y seconds (where y should be a parameter), and as soon as there is new data, you can start working in that directory.

+1


source







All Articles