Check column datatype and execute SQL for Integer and Decimal only in Spark SQL

Question

Check column datatype and execute SQL for Integer and Decimal only in Spark SQL

I am trying to check the datatype of a column from a Parquet input file. If the data type is integer or decimal, then start Spark SQL.

//get Array of structfields 
 val datatypes = parquetRDD_subset.schema.fields

//Check datatype of column
 for (val_datatype <- datatypes)  if (val_datatype.dataType.typeName == "integer" || val_datatype.dataType.typeName.contains("decimal"))  
{
 //get the field name
val x = parquetRDD_subset.schema.fieldNames

 val dfs = x.map(field => spark.sql(s"select 'DataProfilerStats' as Table_Name,(SELECT 100 * approx_count_distinct($field)/count(1) from parquetDFTable) as Percentage_Unique_Value from parquetDFTable"))

 }

The problem is that although the data type check was successful, although inside the for loop after getting the field names it does not actually restrict the columns to only integers or decimal numbers, the query is done for all column types as well, even for strings. How can we get fields that are only decimal or whole numbers. How do we solve this.

+3

scala apache-spark apache-spark-sql spark-dataframe spark-streaming

sabby Jul 25 17 at 20:31

source to share

2 answers

Please try the following:

import org.apache.spark.sql.types._

val names = df.schema.fields.collect { 
  case StructField(name, DecimalType(), _, _) => approx_count_distinct(name)
  case StructField(name, IntegerType, _, _)   => approx_count_distinct(name)
}

spark.table("parquetDFTable").select(names: _*)

+1

random-impressions Jul 25 17 at 20:55

source to share

Shankar koirala · Accepted Answer · 2017-07-26T03:03:57+0000

This is how you can filter columns using integer and double type

// fiter the columns 
val columns = df.schema.fields.filter(x => x.dataType == IntegerType || x.dataType == DoubleType)

//use these filtered with select 
df.select(columns.map(x => col(x.name)): _*)

Hope this helps!

Check column datatype and execute SQL for Integer and Decimal only in Spark SQL

More articles: