Distributed file processing in Hadoop?

Question

Distributed file processing in Hadoop?

I have a large number of compressed tar files where each tar contains multiple files. I want to extract these files and I want to use hadoop or a similar technique to speed up processing. Are there any tools for this kind of problem? As far as I know, howop and similar structures like spark or flink don't use files directly and don't give you direct access to the filesystem. I also want to do a basic rename of the extracted files and move them to their respective directories.

I can think of a solution that creates a list of all tar files. This list is then passed to cartographers, and one module displays one file from the list. Is this a sane approach?

+3

batch-processing hadoop apache-spark apache-flink

headmyshoulder 05 Aug '15 at 8:19

source to share

2 answers

Distcp moves files from one location to another, you can look at its docs, but I don't think it offers the option to decompress or decompress? If the file is larger than main memory, you are likely to get some memory errors. 8gb is not very big for a Hadoop cluster, how many machines do you have?

0

mattinbits 10 Aug '15 at 6:50

source to share

mattinbits · Accepted Answer · 2015-08-05T09:06:49+0000

You can tell MapReduce to use an input format where the input for each Mapper is one file. (from https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3 )

public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {

  @Override
  protected boolean isSplitable(JobContext context, Path filename) {
    return false;
  }

  @Override
  public RecordReader<NullWritable, BytesWritable> createRecordReader(
    InputSplit inputSplit, TaskAttemptContext context) throws IOException,
  InterruptedException {
    WholeFileRecordReader reader = new WholeFileRecordReader();
    reader.initialize(inputSplit, context);
    return reader;
  }
}

Then, in your cartographer, you can use the Apache commons compression library to decompress the tar file https://commons.apache.org/proper/commons-compress/examples.html

you don't need to pass the list of files to Hadoop, just put all files in one HDFS directory and use that directory as your input path.

Distributed file processing in Hadoop?

More articles: