Hadoop map reduces the input format of the whole file

I'm trying to use the hadoop map shortcut, but instead of displaying every line at a time in my Mapper, I would like to display the whole file at once.

So I found these two classes ( https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3 ) This should help me to do this is.

And I got a compilation error:

The setInputFormat method (class) in type JobConf is not valid for arguments (Class) Driver.java/ex2/src line 33 Java Problem

I changed my driver class to

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

import forma.WholeFileInputFormat;

/*
 * Driver
 * The Driver class is responsible of creating the job and commiting it.
 */
public class Driver {
    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(Driver.class);
        conf.setJobName("Get minimun for each month");

        conf.setOutputKeyClass(IntWritable.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        // previous it was 
        // conf.setInputFormat(TextInputFormat.class);
        // And it was changed it to :
        conf.setInputFormat(WholeFileInputFormat.class);

        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf,new Path("input"));
        FileOutputFormat.setOutputPath(conf,new Path("output"));

        System.out.println("Starting Job...");
        JobClient.runJob(conf);
        System.out.println("Job Done!");
    }

}

      

What am I doing wrong?

+3


source to share


3 answers


The easiest way to do this is to gzip your input file. This will cause it to FileInputFormat.isSplitable()

return false.



+2


source


Make sure your wholeFileInputFormat class has correct imports. You are using the old MapReduce Api in your driver. I think you have imported the new FileInputFormat API in your WholeFileInputFormat class. If I am correct, you should import org.apache.hadoop.mapred.FileInputFormat in your allFileInputFormat class instead of org.apache.hadoop.mapreduce.lib.input.FileInputFormat .



Hope it helps.

+1


source


We also encountered something similar and had an alternative approach to work.

Let's say you need to process 100 large files (f1, f2, ..., f100) so that you need to read one entire file into a map function. So instead of using the "WholeInputFileFormat" read method, we created the equivalent 10 text files (p1, p2, ..., p10) for each file containing either the HDFS URL or the f1-f100 file web address.

So p1 will contain url for f1-f10, p2 will refer to f11-f20 and so on.

These new p1 thru p10 files are then used as input for cartographers. So mper processing file p1 of mper file will open file f1 through f10 one by one and process it completely.

This approach allowed us to control the number of cartographers and write more comprehensive and complex application logic in a zoom out application. For example, we could run NLP on PDFs using this approach.

+1


source







All Articles