Apache Spark JDBCRDD uses HDFS?

Question

Apache Spark JDBCRDD uses HDFS?

Does Apache Spark use JDBCRDD HDFS to store and distribute database records across worker nodes? We are using JdbcRDD to interact with the database for apache spark. We are wondering if Apache Spark HDFS is using the database table to propagate and store records, or if worker nodes interact directly with db.

+3

hadoop hdfs apache-spark rdd spark-streaming

Jessica smith 05 Aug '15 at 9:21

source to share

1 answer

mattinbits · Answer 1 · 2015-08-05T09:30:33+0000

JdbcRDD does not use HDFS, reads data from JDBC connection directly into RDD in working memory. If you want results on HDFS, you will have to explicitly store the RDD in HDFS.

You can see how JdbcRDD works here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

RDDs must implement a method compute

that returns an iterator for the values of each section in the RDD. The JdbcRDD implementation simply wraps the JDBC result set iterator:

override def getNext(): T = {
      if (rs.next()) {
        mapRow(rs)
      } else {
        finished = true
        null.asInstanceOf[T]
      }
}

Apache Spark JDBCRDD uses HDFS?

More articles: