Spark MLLib How to Ignore Functions When Training a Classifier

Question

Spark MLLib How to Ignore Functions When Training a Classifier

I would like to train an RDD [LabeledPoint] classifier using only a subset of the functions in each LabeledPoint (both for quickly setting up the model and for including elements in each LabeledPoint, such as IDs or scores that are not functions). I have searched the documentation and cannot find a way to specify which columns should be included or ignored. Below code, I am using Spark and MLLib 1.3.1, Scala 2.10.4.

If it is not possible to exclude a specific feature, it would still be useful to include an identifier with every data point that is ignored during training. Any help is appreciated!

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int](5 -> 2)
val numTrees = 100
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 6
val maxBins = 20
val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

+3

machine-learning apache-spark apache-spark-mllib

user1109152 09 June 15 at 17:49

source to share

1 answer

vvladymyrov · Answer 1 · 2015-06-10T02:44:07+0000

Do you want to select a subset of features before building the model, or do you want some custom strategy for the RandomForest classifier to be used between iterations?

If this is the first case - you can transform trainingData with map transformation before building the model.

See the section on Feature Selection in MLlib - Feature Extraction and Transformation for examples of feature selection.

Spark MLLib How to Ignore Functions When Training a Classifier

More articles: