Persist HDFS object object using spark

I have a person object as shown below:

Person person = new Person(); person.setPersonId("10"); person.setAge(20);

I want to save it to HDFS using Spark. This can be done using the spark mode save method of the DataFrame class as shown below:

dataFrame.save("hdfs://localhost:9000/sample.json");

but I haven't found any way to convert the object of the object to RDD

orDataFrame

Is there a way to convert the object object to RDD or DataFrame?

+3


source to share


1 answer


I suggest you convert your Person object to List. And SparkContext can use the "parallelize" api to transform List to RDD. And RDD can use the "saveAsObjectFile" api to save to hdfs by file sequence. I assume you are Java coding. This is a sample code as shown below.

  SparkConf sparkConf = new SparkConf().setAppName("SparkSaveToHDFS");
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);

 Person Peter = new Person();
 Peter.setName("Peter");
 Peter.setAge(30);
 Person Kevin = new Person();
 Kevin.setName("Kevin");
 Kevin.setAge(40);

 List<Person> personList = new ArrayList<Person>();
 personList.add(0, Peter);
 personList.add(1,Kevin);
 System.out.println("list contains Peter : " + personList.contains(Peter) + Peter.getAge());
 System.out.println("list contains Kevin : " + personList.contains(Kevin) + Kevin.getAge());

 JavaRDD<Person> personRdd = ctx.parallelize(personList);
 personRdd.saveAsObjectFile("hdfs://hadoop-master:8020/Peter/test");     

      



One final use of the SparkContext "objectFile" api to get the hdfs object in the RDD Sample code below

 JavaRDD<Person> getPersonRdd = ctx.objectFile("hdfs://hadoop-master:8020/Peter/test");
    DataFrame schemaPeople = sqlContext.createDataFrame(getPersonRdd, Person.class);
    schemaPeople.registerTempTable("people");
    schemaPeople.printSchema();
    DataFrame people = sqlContext.sql("select * from people");
    people.show();

      

+3


source







All Articles