Writing vs SQL using Dataframe APIs in Spark SQL

Question

Writing vs SQL using Dataframe APIs in Spark SQL

I am the new bee in the Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage, raw and application tier in HDFS and doing CDC (change data collection), currently written in Hive queries and done through Oozie. This needs to be migrated to the Spark app (current version 1.6). Another section of code will be carried over later.

In spark-SQL, I can create dataframes directly from tables in Hive and just execute queries as they are (for example sqlContext.sql("my hive hql")

). Another way would be to use the dataframe APIs and rewrite the hql that way.

What is the difference between the two approaches?

Are there any performance gains when using the Dataframe API?

Some people have suggested that there is an additional layer of SQL through which the Spark core must work when using "SQL" queries directly, which may have some performance impact, but I have not found any material to support this claim. I know the code would be much more compact with the Datafrmae API, but when I have all the handy hql queries, would it really be worth writing the complete code in the Dataframe API?

Thank you.

+18

hive hdfs apache-spark apache-spark-sql spark-dataframe

PPPP 01 Aug 17 at 6:49 am

source to share

3 answers

In your Spark SQL string queries, you will not recognize a syntax error until runtime (which can be costly), whereas in DataFrames, syntax errors can be caught at compile time.

+10

Arun sharma May 28 '18 at 21:07

source to share

If the query is long, then efficiently writing & Query execution should not be possible. On the other hand, the DataFrame, together with the Column API, helps the developer write compact code that is ideal for ETL applications.

In addition, all operations (eg more, less than select, where, etc.) .... performed using the DataFrame create an "Abstract Syntax Tree (AST)", which is then passed to "Catalyst" for further optimization. ( Source: Spark SQL Documentation, Section # 3.3 )

0

GSTomar 06 Sep '19 at 7:42

source to share

Ram Ghadiyaram · Accepted Answer · 2017-08-01T13:12:48+0000

Question: What is the difference in these two approaches? Are there any performance gains when using the Dataframe API?

Answer:

There is a comparative study done by Horton. source ...

The gist is based on a situation / scenario, each of which is right. there is no hard and fast rule to decide this. Please go below.

RDD, DataFrames and SparkSQL (Infact 3 is not only suitable 2):

At its core, Spark uses the concept of Fault Tolerant Distributed Datasets, or RDDs:

Elastic - in case of data loss in memory, it can be restored
Distributed - An immutable distributed set of objects in memory, divided into many data nodes in a cluster
Dataset - raw data can be obtained from files, created programmatically, from data in memory or from another SDR

The DataFrames API is a data abstraction framework that organizes your data into named columns:

Create schema for data
Conceptually equivalent to a table in a relational database
Can be generated from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational data view for simple SQL operations such as data manipulation and aggregation
Under the hood is a number of SDRs

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL via:

SQL
DataFrames API
Datasets API

Test results:

RDD beats DataFrames and SparkSQL for some types of data processing
DataFrames and SparkSQL worked in almost the same way, although SparkSQL had a slight advantage with analysis involving aggregation and sort
In terms of syntax, DataFrames and SparkSQL are much more intuitive than using RDD
Took the best of 3 for each test
Timing was consistent and there weren't big differences between tests
Tasks were completed individually, other tasks were not completed

Random search by 1 order ID out of 9 million unique order IDs Group all different products with their common counts and sorted descending by product name

Writing vs SQL using Dataframe APIs in Spark SQL

RDD, DataFrames and SparkSQL (Infact 3 is not only suitable 2):

Test results:

More articles: