Number of spark frame sections created by reading data from the Hive table

Question

I have a question about the number of section blocks for spark frames.

If I have a Hive (employee) table that has columns (name, age, id, location).

CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);

If the employee table has 10 different locations. Thus, the data will be split into 10 partitions in HDFS.

If I create a Spark data file (df) by reading all the data of the Hive (employee) table.

How many Spark partitions will be created for the dataframe (df)?

df.rdd.partitions.size = ??

+2

hive apache-spark-sql

Sri May 10 '17 at 8:09

source to share

1 answer

sk79 · Answer 1 · 2017-05-10T11:58:00+0000

Partitions are created based on the HDFS block size.

Imagine you read 10 partitions as one RDD and if the block size is 128MB then

no from partitions = (size (10 partitions in MB)) / 128 MB

will be saved to HDFS.

Please refer to the following link:

http://www.bigsynapse.com/spark-input-output