How do I split columns into two sets for each type?

Question

How do I split columns into two sets for each type?

I have a CSV input file. We read that using the following

val rawdata = spark.
  read.
  format("csv").
  option("header", true).
  option("inferSchema", true).
  load(filename)

This reads the data neatly and builds the circuit.

The next step is to split the columns into String and Integer columns. How?

If the following schema of my dataset ...

scala> rawdata.printSchema
root
 |-- ID: integer (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Dept: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)

I would like to split this into two variables (StringCols, IntCols) where:

StringCols must have "First Name", "Last Name", "Department"
IntCols must have "ID", "Age", "DailyRate", "DistanceFromHome"

Here's what I've tried:

val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)

Now in types

I would like to loop and find everything StringType

and search in names for the column name, similarly for IntegerType

.

+3

scala apache-spark apache-spark-sql

Balaji krishnan 05 june 17 at 11:47

source to share

2 answers

Use the dtypes operator:

dtypes: Array [(String, String)] Returns all column names and their data types as an array.

This will give you a more idiomatic way to work with the dataset schema.

val rawdata = Seq(
  (1, "First Name", "Last Name", 43, 2000, "Dept", 0)
).toDF("ID", "First Name", "Last Name", "Age", "DailyRate", "Dept", "DistanceFromHome")
scala> rawdata.dtypes.foreach(println)
(ID,IntegerType)
(First Name,StringType)
(Last Name,StringType)
(Age,IntegerType)
(DailyRate,IntegerType)
(Dept,StringType)
(DistanceFromHome,IntegerType)

I want to split this into two variables (StringCols, IntCols)

(I would rather use immutable values instead, if you don't mind)

val emptyPair = (Seq.empty[String], Seq.empty[String])
val (stringCols, intCols) = rawdata.dtypes.foldLeft(emptyPair) { case ((strings, ints), (name: String, typ)) =>
  typ match {
    case _ if typ == "StringType" => (name +: strings, ints)
    case _ if typ == "IntegerType" => (strings, name +: ints)
  }
}

StringCols should have "First name", "Last name", "Department", and IntCols should have "ID", "Age", "DailyRate", "DistanceFromHome"

You can reverse

collections, but I would rather not do it as costly and give you nothing in return.

0

Jacek Laskowski 05 june 17 at 15:34

source to share

eliasah · Accepted Answer · 2017-06-05T12:01:22+0000

Here you can filter your columns by type using base schema

anddataType

import org.apache.spark.sql.types.{IntegerType, StringType}

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)

val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)

How do I split columns into two sets for each type?

More articles: