Why is .show () on a 20-line PySpark dataset so slow?

Question

Why is .show () on a 20-line PySpark dataset so slow?

I am using PySpark in Jupyter notebook. The next step takes up to 100 seconds, which is fine.

toydf = df.select("column_A").limit(20)

However, the next step show()

takes 2-3 minutes. It only contains 20 lines of lists of integers, and each list contains at most 60 elements. Why so long?

toydf.show()

df

is generated like this:

spark = SparkSession.builder\
    .config(conf=conf)\
    .enableHiveSupport()\
    .getOrCreate()
df = spark.sql("""SELECT column_A
                        FROM datascience.email_aac1_pid_enl_pid_1702""")

+3

hive apache-spark pyspark apache-spark-sql

user2205916 10 jul. 17 at 22:07

source to share

1 answer

code.gsoni · Answer 1 · 2018-12-06T00:55:01+0000

There are two main concepts in sparks:

1: Transforms: whenever you apply withColumn, drop, joins, or groupBy, they actually evaluate that they are just producing a new dataframe or RDD.

2: Actions: Rather, in the case of actions such as counting, showing, displaying, recording, actually doing all the transformation work. and all these actions internally call the Spark RunJob API to run all transformations as a Job.

And for the user case, when you click toydf = df.select ("column_A"). Limit (20) nothing happens.

But when you use the Show () method, which is an action, so it will collect data from the cluster to the Driver node, at which time it actually evaluates your toydf = df.select ("column_A"). Limit (20) .

Why is .show () on a 20-line PySpark dataset so slow?

More articles: