Spark MLLib How to Ignore Functions When Training a Classifier
I would like to train an RDD [LabeledPoint] classifier using only a subset of the functions in each LabeledPoint (both for quickly setting up the model and for including elements in each LabeledPoint, such as IDs or scores that are not functions). I have searched the documentation and cannot find a way to specify which columns should be included or ignored. Below code, I am using Spark and MLLib 1.3.1, Scala 2.10.4.
If it is not possible to exclude a specific feature, it would still be useful to include an identifier with every data point that is ignored during training. Any help is appreciated!
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int](5 -> 2)
val numTrees = 100
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 6
val maxBins = 20
val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
source to share
Do you want to select a subset of features before building the model, or do you want some custom strategy for the RandomForest classifier to be used between iterations?
If this is the first case - you can transform trainingData with map transformation before building the model.
See the section on Feature Selection in MLlib - Feature Extraction and Transformation for examples of feature selection.
source to share