Principled regression covariance (PCovR) in high-size environments

I would like to use Covariate Principle Regression in high-dimensional settings where I have more explanatory variables (J) than observations (N). I came across the R package "PCovR" (see the CRAN-R package here and the statistical software article here ). This package works great in low-size settings.

However, the package does not work in large settings. To run into a large problem, you can run the following (minimum viable example) code:

# Load package
library(PCovR)    

# Fix random number generator
set.seed(1)

# Generate X: random standard normal matrix with J=200 explanatory variables and N=100 observations
x <- matrix(nrom(n=20000, mean=0, sd=1), nrow=100, ncol=200); dim(x)

# Generate Y: random standard normal vector with N=100 observations
y <- rnorm(n=100, mean=0, sd=1)  

# Run PCovR
pcovr.fit <- pcovr(X=x, Y=y, modsel="seq")

      

This gives the following error:

R> Error in Vminc[k] = which.min(A[, k]) : replacement has length zero 

      

To tune the R (component count) and alpha (weight) parameters, the package comes with the option of fast sequential model estimation based on maximum likelihood ("modsel = seq") and computational demanding simultaneous mesh-based estimation -search cross-validation ("modsel = sim ").

The source of the problem in higher dimensions is that the argument of the ratio (which is calculated by default by the ErrorRatio function if "modsel = seq") cannot be explicitly determined because the linear regression is performed in the ErrorRatio function. A valid but suboptimal solution is a concurrent procedure with a preliminary specification of the argument-relation (since it will not be used in a concurrent procedure),

pcovr.fit <- pcovr(X=x, Y=y, modsel="sim", ratio=1)

      

But this is extremely computationally difficult. Any ideas, hints or suggestions on how I can get PCovR up and running in high-res settings?

+3


source to share


1 answer


A package update (version 2.7) was released on CRAN earlier this week. In this newer version, the "ratio" parameter is set to 1 by default in the high-dimension settings. Of course, you can ask for another ratio, but will lead to tiny changes in the obtained alpha value only in the case of standardized data with 200 predictors (J) and only 1 criterion (K), since the maximum probability of alpha is obtained using the following formula (for standardized data) :

alpha <- J/(J+K*ratio)

      



Another way to search is to find another proxy due to errors in your data, but in this particular situation, you will probably end up with an alpha value of .99. However, it would be interesting to investigate the effect of choosing a different alpha value on the resulting solution.

+1


source







All Articles