Check column datatype and execute SQL for Integer and Decimal only in Spark SQL
I am trying to check the datatype of a column from a Parquet input file. If the data type is integer or decimal, then start Spark SQL.
//get Array of structfields
val datatypes = parquetRDD_subset.schema.fields
//Check datatype of column
for (val_datatype <- datatypes) if (val_datatype.dataType.typeName == "integer" || val_datatype.dataType.typeName.contains("decimal"))
{
//get the field name
val x = parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql(s"select 'DataProfilerStats' as Table_Name,(SELECT 100 * approx_count_distinct($field)/count(1) from parquetDFTable) as Percentage_Unique_Value from parquetDFTable"))
}
The problem is that although the data type check was successful, although inside the for loop after getting the field names it does not actually restrict the columns to only integers or decimal numbers, the query is done for all column types as well, even for strings. How can we get fields that are only decimal or whole numbers. How do we solve this.
+3
source to share
2 answers
Please try the following:
import org.apache.spark.sql.types._
val names = df.schema.fields.collect {
case StructField(name, DecimalType(), _, _) => approx_count_distinct(name)
case StructField(name, IntegerType, _, _) => approx_count_distinct(name)
}
spark.table("parquetDFTable").select(names: _*)
+1
source to share