Apply PCA and keep the percentage of total variance

I want to perform a principal component analysis on a specific dataset and then pass the major components to a classifier LogisticRegression


Specifically, I want to apply PCA

and keep 90% of the total variance using a function computePrincipalComponentsAndExplainedVariance


Here's the code to read the dataset:

// Load the data
val text = sparkSession.sparkContext.textFile("")        
val data = => line.split(',').map(_.toDouble))  
// Separate to label and features
val dataLP = => (t(57), Vectors.dense(t.take(57)))) 


I'm not really sure how to do the PCA to maintain 90% of the total variance.


source to share

1 answer

With the function, the computePrincipalComponentsAndExplainedVariance

return value will be a matrix as well as a vector with values ​​indicating the variance explained for each principal component. From the documentation :

Returns: An n-by-k matrix whose columns are the principal components, and a vector of values ​​indicating how many variances each principal component explains

Using a large enough k for input, you can simply sum the numbers in the vector to 90% or higher and then use as many columns from the matrix.



All Articles