LDA Cross Validator

I want to cross-validate the LDA algorithm to determine the number of topics (K). My doubt is about the evaluator as I want to use the likelihood log. What do I set to .setEvaluator (????) when creating a cross validation?

// Define a simple LDA
val lda = new LDA()
  .setMaxIter(10)
  .setFeaturesCol("features")

// We use a ParamGridBuilder to construct a grid of parameters to search over.
val range = 1 to 20
val paramGrid = new ParamGridBuilder()
    .addGrid(lda.k, range.toArray )
    .build()   

// Create a CrossValidator
val cv = new CrossValidator()
  .setEstimator(lda)
  .setEvaluator(????)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(5)

      

+3


source to share


1 answer


Cross-validation will not be easy to apply when you are doing unsupervised learning effectively. If you haven't checked the training data, the interfaces provided by CrossValidator are unlikely to be relevant. The fact that you are trying to use different k values, the number of topics generated by the LDA indicates that you may not have this kind of labeled training data.



If you try to reassign the CrossValidator , I don't think there are any suitable Evaluators (at least from Spark-2.2). If you study different dimensions of a model (for example, changing the number of topics, k), then the log probability of the data is not trivial for comparison between models that have different dimensions. For example, as the number of classes increases, you expect the likelihood of data growth to increase, but with the risk of reassignment. One standard approach is to use something like the Akaike information criterion to punish models that are more complex (eg greater than k). Again, I don't think this is currently supported in CrossValidator.

+1


source







All Articles