Apply PCA and keep the percentage of total variance

I want to perform a principal component analysis on a specific dataset and then pass the major components to a classifier LogisticRegression

.

Specifically, I want to apply PCA

and keep 90% of the total variance using a function computePrincipalComponentsAndExplainedVariance

.

Here's the code to read the dataset:

// Load the data
val text = sparkSession.sparkContext.textFile("dataset.data")        
val data = text.map(line => line.split(',').map(_.toDouble))  
// Separate to label and features
val dataLP = data.map(t => (t(57), Vectors.dense(t.take(57)))) 

      

I'm not really sure how to do the PCA to maintain 90% of the total variance.

+3


source to share


1 answer


With the function, the computePrincipalComponentsAndExplainedVariance

return value will be a matrix as well as a vector with values ​​indicating the variance explained for each principal component. From the documentation :

Returns: An n-by-k matrix whose columns are the principal components, and a vector of values ​​indicating how many variances each principal component explains



Using a large enough k for input, you can simply sum the numbers in the vector to 90% or higher and then use as many columns from the matrix.

+2


source







All Articles