Hadoop tools to move files from local file system to HDFS
I am doing a POC on ways to import data from a shared network drive to HDFS. The data will be in different folders on the shared drive, and each folder will correspond to a different directory on the HDFS. I've looked at some popular tools that do this, but most are designed to move small chunks of data, not the entire file. These are the tools I found, are there others?
Apache Flume: If there are only a few production servers that are producing data and the data does not need to be written in real time, then it might also make sense to just move the data to HDFS over Web HDFS or NFS, especially if the amount of data being written out is relatively less - multiple files with a few GB every few hours won't harm HDFS. In this case, planning, configuring and deploying Flume may not be worth it. Flume is really designed to push events in real time, and the data stream is continuous and the volume is large enough. [Online Safari Fluid Book & Cookbook)
Apache Kafka: Producer-Consumer Model: Messages are persisted to disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without impacting performance.
Amazon Kinesis: Paid version for real-time data like Flume
WEB HDFS: Send HTTP PUT request without automatic redirection and without sending files. Send another HTTP PUT request using the URL in the Location header with the file data to be written. [ http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE]
Open source projects: https://github.com/alexholmes/hdfs-file-slurper
My requirements are simple:
- Ask the directory for the file, if the file arrives, copy it to HDFS and move the file to the "processed" directory.
- I need to do this for multiple directories
source to share
Try trying flume with a buffering directory source . You didn't specify your size or data rate, but I did a similar POC from local Linux filesystem to Kerberized hdfs cluster with good results using a single flume agent running on the edge of node.
source to share
Try dtingest , it supports receiving data from different sources like shared disk, NFS, FTP to HDFS. They also periodically maintain polling directories. It should be available for a free trial download. It is developed on the Apache Apex platform .
source to share
Check out Toad for Hadoop 1.5. The latest release introduces an interface based on the Local to HDFS Sync ftp interface, with many options to help users maintain local and HDFS environments in Sync. Link to blog post here .
source to share