Apply PCA and keep the percentage of total variance
I want to perform a principal component analysis on a specific dataset and then pass the major components to a classifier LogisticRegression
.
Specifically, I want to apply PCA
and keep 90% of the total variance using a function computePrincipalComponentsAndExplainedVariance
.
Here's the code to read the dataset:
// Load the data
val text = sparkSession.sparkContext.textFile("dataset.data")
val data = text.map(line => line.split(',').map(_.toDouble))
// Separate to label and features
val dataLP = data.map(t => (t(57), Vectors.dense(t.take(57))))
I'm not really sure how to do the PCA to maintain 90% of the total variance.
source to share
With the function, the computePrincipalComponentsAndExplainedVariance
return value will be a matrix as well as a vector with values indicating the variance explained for each principal component. From the documentation :
Returns: An n-by-k matrix whose columns are the principal components, and a vector of values indicating how many variances each principal component explains
Using a large enough k for input, you can simply sum the numbers in the vector to 90% or higher and then use as many columns from the matrix.
source to share