ML Trumpet for Scala Spark

Question

ML Trumpet for Scala Spark

I have a dataframe (df) with the following structure:

Data

label pa_age pa_gender_category
10000 32.0   male
25000 36.0   female
45000 68.0   female
15000 24.0   male

purpose

I wanted to create a RandomForest classifier for the "label" column where the "pa_age" and "pa_gender_category" columns are functions

Sequential process

// Transform the labels column into labels index

val labelIndexer = new StringIndexer().setInputCol("label")
.setOutputCol("indexedLabel").fit(df)

// Transform column gender_category into labels

val featureTransformer = new StringIndexer().setInputCol("pa_gender_category")
.setOutputCol("pa_gender_category_label").fit(df)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

Expected output from above steps:

label pa_age pa_gender_category indexedLabel pa_gender_category_label
10000 32.0   male               1.0          1.0
25000 36.0   female             2.0          2.0
45000 68.0   female             3.0          2.0
10000 24.0   male               1.0          1.0

Now I need data in 'label' and 'feature' format

val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)

Pipeline

val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))

Problem

error: value fit is not a member of org.apache.spark.ml.feature.VectorAssembler
       val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category_label")).setOutputCol("features").fit(df)

Basically it is a step from converting data to label and function format that I have to deal with.
Is my process / pipeline correct here?

+3

scala apache-spark

Anubhav Dikshit Apr 27. 17 at 6:15 am

source to share

1 answer

Jozef Dúc · Accepted Answer · 2017-04-27T08:07:11+0000

The problem is here

val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)

You cannot call fit(df)

here because it VectorAssembler

has no method fit

. Don't forget to delete .fit(df)

in StringIndexer

and IndexToString

. After initializing the pipeline, call your method fit

on the pipeline object.

val model = pipeline.fit(df)

The pipeline now goes through every algorithm you provide it.

StringIndexer

has no property labels

, use getOutputCol

instead.

ML Trumpet for Scala Spark

More articles: