How do you know that new data has been added to HDFS?

I am implementing a notification system based on a subscription publishing model to notify the availability of data as it arrives / loads into HDFS. I haven't found a way to find this. Is there any HDFS API that can be used for this, or what method should I use to get information about new data written to HDFS? I am using Hadoop v2.0.2 and I do not want to use the HC directory, I want to implement my own tool for that.

+3


source to share


2 answers


What you are looking for is this Oozie Coordinator

.

HDFS

is a filesystem, so something needs to be created above HDFS to check for files. HBase

has a coprocessor that runs procedures. But it is only available for HBase tables. Therefore, it cannot be used to detect data availability in HDFS.

Oozie is a workflow scheduler system for managing Hadoop jobs. Oozie Coordinator Work Orders - Oozie's current work day tasks driven by time (frequency) and data availability. You can also execute other programs from it:



Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (e.g. Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop, and Distcp) as well as the system (e.g. Java programs and shell scripts) ...

This way you can use a file availability trigger for your notification system.

+3


source


If you are using HDFS you can check HBase as you have the required functionality. In HBase, you can create a pre-put (or post-put) coprocessor, essentially acting equivalent to the MySQL Trigger, by triggering a bit of code every time data is written to the table.

If HBase doesn't fit your use case and you must use HDFS, AFAIK there are no similar triggers. You can try wrapping the HDFS API with your own code to perform a notification when data is written to your filesystem under appropriate circumstances. Alternatively, you can poll HDFS for changes (which sounds like an ugly alternative) ...



Hope it helps

+1


source







All Articles