ML Trumpet for Scala Spark
I have a dataframe (df) with the following structure:
Data
label pa_age pa_gender_category
10000 32.0 male
25000 36.0 female
45000 68.0 female
15000 24.0 male
purpose
I wanted to create a RandomForest classifier for the "label" column where the "pa_age" and "pa_gender_category" columns are functions
Sequential process
// Transform the labels column into labels index
val labelIndexer = new StringIndexer().setInputCol("label")
.setOutputCol("indexedLabel").fit(df)
// Transform column gender_category into labels
val featureTransformer = new StringIndexer().setInputCol("pa_gender_category")
.setOutputCol("pa_gender_category_label").fit(df)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
Expected output from above steps:
label pa_age pa_gender_category indexedLabel pa_gender_category_label
10000 32.0 male 1.0 1.0
25000 36.0 female 2.0 2.0
45000 68.0 female 3.0 2.0
10000 24.0 male 1.0 1.0
Now I need data in 'label' and 'feature' format
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)
Pipeline
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))
Problem
error: value fit is not a member of org.apache.spark.ml.feature.VectorAssembler
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category_label")).setOutputCol("features").fit(df)
-
Basically it is a step from converting data to label and function format that I have to deal with.
-
Is my process / pipeline correct here?
source to share
The problem is here
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)
You cannot call fit(df)
here because it VectorAssembler
has no method fit
. Don't forget to delete .fit(df)
in StringIndexer
and IndexToString
. After initializing the pipeline, call your method fit
on the pipeline object.
val model = pipeline.fit(df)
The pipeline now goes through every algorithm you provide it.
StringIndexer
has no property labels
, use getOutputCol
instead.
source to share