Number of spark frame sections created by reading data from the Hive table

I have a question about the number of section blocks for spark frames.

If I have a Hive (employee) table that has columns (name, age, id, location).

CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);

If the employee table has 10 different locations. Thus, the data will be split into 10 partitions in HDFS.

If I create a Spark data file (df) by reading all the data of the Hive (employee) table.

How many Spark partitions will be created for the dataframe (df)?

df.rdd.partitions.size = ??

+2


source to share


1 answer


Partitions are created based on the HDFS block size.

Imagine you read 10 partitions as one RDD and if the block size is 128MB then

no from partitions = (size (10 partitions in MB)) / 128 MB



will be saved to HDFS.

Please refer to the following link:

http://www.bigsynapse.com/spark-input-output

+2


source







All Articles