Writing vs SQL using Dataframe APIs in Spark SQL

I am the new bee in the Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage, raw and application tier in HDFS and doing CDC (change data collection), currently written in Hive queries and done through Oozie. This needs to be migrated to the Spark app (current version 1.6). Another section of code will be carried over later.

In spark-SQL, I can create dataframes directly from tables in Hive and just execute queries as they are (for example sqlContext.sql("my hive hql")

). Another way would be to use the dataframe APIs and rewrite the hql that way.

What is the difference between the two approaches?

Are there any performance gains when using the Dataframe API?

Some people have suggested that there is an additional layer of SQL through which the Spark core must work when using "SQL" queries directly, which may have some performance impact, but I have not found any material to support this claim. I know the code would be much more compact with the Datafrmae API, but when I have all the handy hql queries, would it really be worth writing the complete code in the Dataframe API?

Thank you.

+18


source to share


3 answers


  Question: What is the difference in these two approaches? Are there any performance gains when using the Dataframe API?


Answer:

There is a comparative study done by Horton. source ...

The gist is based on a situation / scenario, each of which is right. there is no hard and fast rule to decide this. Please go below.

RDD, DataFrames and SparkSQL (Infact 3 is not only suitable 2):

At its core, Spark uses the concept of Fault Tolerant Distributed Datasets, or RDDs:

  • Elastic - in case of data loss in memory, it can be restored
  • Distributed - An immutable distributed set of objects in memory, divided into many data nodes in a cluster
  • Dataset - raw data can be obtained from files, created programmatically, from data in memory or from another SDR


The DataFrames API is a data abstraction framework that organizes your data into named columns:

  • Create schema for data
  • Conceptually equivalent to a table in a relational database
  • Can be generated from many sources including structured data files, tables in Hive, external databases, or existing RDDs
  • Provides a relational data view for simple SQL operations such as data manipulation and aggregation
  • Under the hood is a number of SDRs

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL via:

  • SQL
  • DataFrames API
  • Datasets API

Test results:

  • RDD beats DataFrames and SparkSQL for some types of data processing
  • DataFrames and SparkSQL worked in almost the same way, although SparkSQL had a slight advantage with analysis involving aggregation and sort

  • In terms of syntax, DataFrames and SparkSQL are much more intuitive than using RDD

  • Took the best of 3 for each test

  • Timing was consistent and there weren't big differences between tests

  • Tasks were completed individually, other tasks were not completed

Random search by 1 order ID out of 9 million unique order IDs Group all different products with their common counts and sorted descending by product name

enter image description here

+12


source


In your Spark SQL string queries, you will not recognize a syntax error until runtime (which can be costly), whereas in DataFrames, syntax errors can be caught at compile time.



+10


source


If the query is long, then efficiently writing & Query execution should not be possible. On the other hand, the DataFrame, together with the Column API, helps the developer write compact code that is ideal for ETL applications.

In addition, all operations (eg more, less than select, where, etc.) .... performed using the DataFrame create an "Abstract Syntax Tree (AST)", which is then passed to "Catalyst" for further optimization. ( Source: Spark SQL Documentation, Section # 3.3 )

0


source







All Articles