Check column datatype and execute SQL for Integer and Decimal only in Spark SQL

I am trying to check the datatype of a column from a Parquet input file. If the data type is integer or decimal, then start Spark SQL.

//get Array of structfields 
 val datatypes = parquetRDD_subset.schema.fields

//Check datatype of column
 for (val_datatype <- datatypes)  if (val_datatype.dataType.typeName == "integer" || val_datatype.dataType.typeName.contains("decimal"))  
{
 //get the field name
val x = parquetRDD_subset.schema.fieldNames

 val dfs = x.map(field => spark.sql(s"select 'DataProfilerStats' as Table_Name,(SELECT 100 * approx_count_distinct($field)/count(1) from parquetDFTable) as Percentage_Unique_Value from parquetDFTable"))

 }

      

The problem is that although the data type check was successful, although inside the for loop after getting the field names it does not actually restrict the columns to only integers or decimal numbers, the query is done for all column types as well, even for strings. How can we get fields that are only decimal or whole numbers. How do we solve this.

+3


source to share


2 answers


This is how you can filter columns using integer and double type

// fiter the columns 
val columns = df.schema.fields.filter(x => x.dataType == IntegerType || x.dataType == DoubleType)

//use these filtered with select 
df.select(columns.map(x => col(x.name)): _*)

      



Hope this helps!

+5


source


Please try the following:



import org.apache.spark.sql.types._

val names = df.schema.fields.collect { 
  case StructField(name, DecimalType(), _, _) => approx_count_distinct(name)
  case StructField(name, IntegerType, _, _)   => approx_count_distinct(name)
}

spark.table("parquetDFTable").select(names: _*)

      

+1


source







All Articles