How to store and read data from Spark PairRDD

Spark PairRDD has the ability to save a file.

JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));

JavaPairRDD<String, Integer> myPairRDD =
    baseRDD.mapToPair(new PairFunction<String, String, Integer>() {

      @Override
      public Tuple2<String, Integer> call(String input) throws Exception {
        // TODO Auto-generated method stub
        return new Tuple2<String, Integer>(input, input.length());
      }
    });

myPairRDD.saveAsTextFile("path");

      

Spark context textfile

only reads data in JavaRDD.

How to recover PairRDD directly from source?

Note:

  • A possible approach is to read the data into JavaRDD<String>

    and build JavaPairRDD

    .

But with huge data, it takes up a significant amount of resources.

  • Saving this intermediate file in non-text format is fine too.

  • Runtime - JRE 1.7

+3


source to share


3 answers


Saving Spark PairRDD in Sequence file works well in this scenario.

JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));

JavaPairRDD<Text, IntWritable> myPairRDD = baseRDD.mapToPair(new PairFunction<String, Text, IntWritable>() {

  @Override
  public Tuple2<Text, IntWritable> call(String input) throws Exception {
    // TODO Auto-generated method stub
    return new Tuple2<Text, IntWritable>(new Text(input), new IntWritable(input.length()));
  }
});

myPairRDD.saveAsHadoopFile(path , Text.class, IntWritable.class,
    SequenceFileOutputFormat.class);

JavaPairRDD<Text, IntWritable> newbaseRDD =
    context.sequenceFile(path , Text.class, IntWritable.class);

// Verify the data
System.out.println(myPairRDD.collect());
newbaseRDD.foreach(new VoidFunction<Tuple2<Text, IntWritable>>() {
  @Override
  public void call(Tuple2<Text, IntWritable> arg0) throws Exception {
    System.out.println(arg0);
  }
});

      



As suggested by user52045 , the following code works with Java 8.

myPairRDD.saveAsObjectFile(path);
JavaPairRDD<String, String> objpairRDD = JavaPairRDD.fromJavaRDD(context.objectFile(path));
objpairRDD.collect().forEach(System.out::println);

      

+1


source


You can save them as an object file if you don't want the result file to be unreadable.

save file:

myPairRDD.saveAsObjectFile(path);

      

and then you can read pairs like this:



JavaPairRDD.fromJavaRDD(sc.objectFile(path))

      

EDIT:

working example:

JavaRDD<String> rdd = sc.parallelize(Lists.newArrayList("1", "2"));
rdd.mapToPair(p -> new Tuple2<>(p, p)).saveAsObjectFile("c://example");
JavaPairRDD<String, String> pairRDD 
    = JavaPairRDD.fromJavaRDD(sc.objectFile("c://example"));
pairRDD.collect().forEach(System.out::println);

      

+4


source


An example using Scala:

Reading a text file and saving it as an object file format

val ordersRDD = sc.textFile("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsObjectFile("orders_save_obj");

      

Reading an object file and saving it as a text file:

val ordersRDD = sc.objectFile[String]("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsTextFile("orders_save_text");

      

+1


source







All Articles