How do I split columns into two sets for each type?

I have a CSV input file. We read that using the following

val rawdata = spark.
  read.
  format("csv").
  option("header", true).
  option("inferSchema", true).
  load(filename)

      

This reads the data neatly and builds the circuit.

The next step is to split the columns into String and Integer columns. How?

If the following schema of my dataset ...

scala> rawdata.printSchema
root
 |-- ID: integer (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Dept: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)

      

I would like to split this into two variables (StringCols, IntCols) where:

  • StringCols must have "First Name", "Last Name", "Department"
  • IntCols must have "ID", "Age", "DailyRate", "DistanceFromHome"

Here's what I've tried:

val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)

      

Now in types

I would like to loop and find everything StringType

and search in names for the column name, similarly for IntegerType

.

+3


source to share


2 answers


Here you can filter your columns by type using base schema

anddataType



import org.apache.spark.sql.types.{IntegerType, StringType}

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)

val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)

      

+3


source


Use the dtypes operator:

dtypes: Array [(String, String)] Returns all column names and their data types as an array.

This will give you a more idiomatic way to work with the dataset schema.

val rawdata = Seq(
  (1, "First Name", "Last Name", 43, 2000, "Dept", 0)
).toDF("ID", "First Name", "Last Name", "Age", "DailyRate", "Dept", "DistanceFromHome")
scala> rawdata.dtypes.foreach(println)
(ID,IntegerType)
(First Name,StringType)
(Last Name,StringType)
(Age,IntegerType)
(DailyRate,IntegerType)
(Dept,StringType)
(DistanceFromHome,IntegerType)

      

I want to split this into two variables (StringCols, IntCols)



(I would rather use immutable values ​​instead, if you don't mind)

val emptyPair = (Seq.empty[String], Seq.empty[String])
val (stringCols, intCols) = rawdata.dtypes.foldLeft(emptyPair) { case ((strings, ints), (name: String, typ)) =>
  typ match {
    case _ if typ == "StringType" => (name +: strings, ints)
    case _ if typ == "IntegerType" => (strings, name +: ints)
  }
}

      

StringCols should have "First name", "Last name", "Department", and IntCols should have "ID", "Age", "DailyRate", "DistanceFromHome"

You can reverse

collections, but I would rather not do it as costly and give you nothing in return.

0


source







All Articles