Apache Spark JDBCRDD uses HDFS?

Does Apache Spark use JDBCRDD HDFS to store and distribute database records across worker nodes? We are using JdbcRDD to interact with the database for apache spark. We are wondering if Apache Spark HDFS is using the database table to propagate and store records, or if worker nodes interact directly with db.

+3


source to share


1 answer


JdbcRDD does not use HDFS, reads data from JDBC connection directly into RDD in working memory. If you want results on HDFS, you will have to explicitly store the RDD in HDFS.

You can see how JdbcRDD works here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala



RDDs must implement a method compute

that returns an iterator for the values โ€‹โ€‹of each section in the RDD. The JdbcRDD implementation simply wraps the JDBC result set iterator:

override def getNext(): T = {
      if (rs.next()) {
        mapRow(rs)
      } else {
        finished = true
        null.asInstanceOf[T]
      }
}

      

+2


source







All Articles