Writing vs SQL using Dataframe APIs in Spark SQL
I am the new bee in the Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage, raw and application tier in HDFS and doing CDC (change data collection), currently written in Hive queries and done through Oozie. This needs to be migrated to the Spark app (current version 1.6). Another section of code will be carried over later.
In spark-SQL, I can create dataframes directly from tables in Hive and just execute queries as they are (for example sqlContext.sql("my hive hql")
). Another way would be to use the dataframe APIs and rewrite the hql that way.
What is the difference between the two approaches?
Are there any performance gains when using the Dataframe API?
Some people have suggested that there is an additional layer of SQL through which the Spark core must work when using "SQL" queries directly, which may have some performance impact, but I have not found any material to support this claim. I know the code would be much more compact with the Datafrmae API, but when I have all the handy hql queries, would it really be worth writing the complete code in the Dataframe API?
Thank you.
source to share
Question: What is the difference in these two approaches? Are there any performance gains when using the Dataframe API?
Answer:
There is a comparative study done by Horton. source ...
The gist is based on a situation / scenario, each of which is right. there is no hard and fast rule to decide this. Please go below.
RDD, DataFrames and SparkSQL (Infact 3 is not only suitable 2):
At its core, Spark uses the concept of Fault Tolerant Distributed Datasets, or RDDs:
- Elastic - in case of data loss in memory, it can be restored
- Distributed - An immutable distributed set of objects in memory, divided into many data nodes in a cluster
- Dataset - raw data can be obtained from files, created programmatically, from data in memory or from another SDR
The DataFrames API is a data abstraction framework that organizes your data into named columns:
- Create schema for data
- Conceptually equivalent to a table in a relational database
- Can be generated from many sources including structured data files, tables in Hive, external databases, or existing RDDs
- Provides a relational data view for simple SQL operations such as data manipulation and aggregation
- Under the hood is a number of SDRs
SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL via:
- SQL
- DataFrames API
- Datasets API
Test results:
- RDD beats DataFrames and SparkSQL for some types of data processing
DataFrames and SparkSQL worked in almost the same way, although SparkSQL had a slight advantage with analysis involving aggregation and sort
In terms of syntax, DataFrames and SparkSQL are much more intuitive than using RDD
Took the best of 3 for each test
Timing was consistent and there weren't big differences between tests
Tasks were completed individually, other tasks were not completed
Random search by 1 order ID out of 9 million unique order IDs Group all different products with their common counts and sorted descending by product name
source to share
If the query is long, then efficiently writing & Query execution should not be possible. On the other hand, the DataFrame, together with the Column API, helps the developer write compact code that is ideal for ETL applications.
In addition, all operations (eg more, less than select, where, etc.) .... performed using the DataFrame create an "Abstract Syntax Tree (AST)", which is then passed to "Catalyst" for further optimization. ( Source: Spark SQL Documentation, Section # 3.3 )
source to share