How to debug "contrasts can only be applied to factors with 2 or more levels"?

Here are all the variables I'm working with:

str(ad.train)
$ Date                : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ...
 $ Team                : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season              : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Round               : Factor w/ 28 levels "EF","GF","PF",..: 5 16 21 22 23 24 25 26 27 6 ...
 $ Score               : int  137 82 84 96 110 99 122 124 49 111 ...
 $ Margin              : int  69 18 -56 46 19 5 50 69 -26 29 ...
 $ WinLoss             : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 2 ...
 $ Opposition          : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 8 18 10 9 13 16 7 3 4 6 ...
 $ Venue               : Factor w/ 19 levels "Adelaide Oval",..: 4 7 10 7 7 13 7 6 7 15 ...
 $ Disposals           : int  406 360 304 370 359 362 365 345 324 351 ...
 $ Kicks               : int  252 215 170 225 221 218 224 230 205 215 ...
 $ Marks               : int  109 102 52 41 95 78 93 110 69 85 ...
 $ Handballs           : int  154 145 134 145 138 144 141 115 119 136 ...
 $ Goals               : int  19 11 12 13 16 15 19 19 6 17 ...
 $ Behinds             : int  19 14 9 16 11 6 7 9 12 6 ...
 $ Hitouts             : int  42 41 34 47 45 70 48 54 46 34 ...
 $ Tackles             : int  73 53 51 76 65 63 65 67 77 58 ...
 $ Rebound50s          : int  28 34 23 24 32 48 39 31 34 29 ...
 $ Inside50s           : int  73 49 49 56 61 45 47 50 49 48 ...
 $ Clearances          : int  39 33 38 52 37 43 43 48 37 52 ...
 $ Clangers            : int  47 38 44 62 49 46 32 24 31 41 ...
 $ FreesFor            : int  15 14 15 18 17 15 19 14 18 20 ...
 $ ContendedPossessions: int  152 141 149 192 138 164 148 151 160 155 ...
 $ ContestedMarks      : int  10 16 11 3 12 12 17 14 15 11 ...
 $ MarksInside50       : int  16 13 10 8 12 9 14 13 6 12 ...
 $ OnePercenters       : int  42 54 30 58 24 56 32 53 50 57 ...
 $ Bounces             : int  1 6 4 4 1 7 11 14 0 4 ...
 $ GoalAssists         : int  15 6 9 10 9 12 13 14 5 14 ...

      

Here's the glm I'm trying to install:

ad.glm.all <- glm(WinLoss ~ factor(Team) + Season  + Round + Score  + Margin + Opposition + Venue + Disposals + Kicks + Marks + Handballs + Goals + Behinds + Hitouts + Tackles + Rebound50s + Inside50s+ Clearances+ Clangers+ FreesFor + ContendedPossessions + ContestedMarks + MarksInside50 + OnePercenters + Bounces+GoalAssists, 
                  data = ad.train, family = binomial(logit))

      

I know there are many variables (the plan is to shorten by direct variable selection). But I even know that there are many variables, they are either int or Factor; which, as I understand it, should only work with glm. However, every time I try to fit this model I get:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

      

What kind of looks like R isn't treating my factor variables as factor variables for some reason?

Even something simple:

ad.glm.test <- glm(WinLoss ~ factor(Team), data = ad.train, family = binomial(logit))

      

does not work! (same error message)

Where is it:

ad.glm.test <- glm(WinLoss ~ Clearances, data = ad.train, family = binomial(logit))

      

Will work!

Does anyone know what's going on here? Why can't I map these factor variables to my glm?

Thanks in advance!

-Troy

+14


source to share


2 answers


Introduction

What is "contrasts error" is well explained: you have a factor that is only one level (or less) . But in reality, this simple fact can be easily obscured, because the data that is actually used to fit the model can be very different from what you passed. This happens when you have NA

in your data, you tweak your data, a factor has unused levels, or you have converted your variables and got somewhere NaN

. You are rarely in this ideal situation where the sibling factor can be discovered directly from str(your_data_frame)

directly.
Many of the questions on StackOverflow regarding this bug are not reproducible, so people's suggestions may or may not work. So while there are 118 posts on this issue at the moment , users still can't find a responsive solution to keep this issue raised over and over again. This answer is my attempt to resolve this issue "once and for all", or at least provide reasonable guidance.

This answer is rich in information, so let me make a quick summary first.

I defined for you 3 auxiliary functions: debug_contr_error

, debug_contr_error2

, NA_preproc

.

I recommend that you use them as follows.

  1. run NA_preproc

    to get more complete cases;
  2. run your model and if you get "contrasts error" use it debug_contr_error2

    for debugging.

Most of the answer shows you step by step how and why these functions are defined. There is probably no harm in skipping this development process, but watch out for the sections from "Reproducible Case Studies and Discussions".


Revised answer

The original answer works great for the OP and has helped others successfully . But it failed elsewhere for lack of adaptability. Look at the output str(ad.train)

in the question. OP variables are numeric or factors; no characters. The original answer was for this situation. If you have symbolic variables, although they will be cast to factors in time glm

lm

and glm

, they will not be reported by the code, since they were not provided as factors, so they is.factor

will be skipped. In this extension, I'll make the original answer more responsive.

Let dat

be your dataset passed to lm

or glm

. If you do not have such a data frame, that is, all your variables are scattered in the global environment, you need to collect them into a data frame. The following may not be the best way, but it works.

## 'form' is your model formula, here is an example
y <- x1 <- x2 <- x3 <- 1:4
x4 <- matrix(1:8, 4)
form <- y ~ bs(x1) + poly(x2) + I(1 / x3) + x4

## to gather variables 'model.frame.default(form)' is the easiest way 
## but it does too much: it drops 'NA' and transforms variables
## we want something more primitive

## first get variable names
vn <- all.vars(form)
#[1] "y"  "x1" "x2" "x3" "x4"

## 'get_all_vars(form)' gets you a data frame
## but it is buggy for matrix variables so don't use it
## instead, first use 'mget' to gather variables into a list
lst <- mget(vn)

## don't do 'data.frame(lst)'; it is buggy with matrix variables
## need to first protect matrix variables by 'I()' then do 'data.frame'
lst_protect <- lapply(lst, function (x) if (is.matrix(x)) I(x) else x)
dat <- data.frame(lst_protect)
str(dat)
#'data.frame':  4 obs. of  5 variables:
# $ y : int  1 2 3 4
# $ x1: int  1 2 3 4
# $ x2: int  1 2 3 4
# $ x3: int  1 2 3 4
# $ x4: 'AsIs' int [1:4, 1:2] 1 2 3 4 5 6 7 8

## note the 'AsIs' for matrix variable 'x4'
## in comparison, try the following buggy ones yourself
str(get_all_vars(form))
str(data.frame(lst))

      

Step 0: explicit subset

If you used the subset

lm

or argument glm

, start with an explicit subset:

## 'subset_vec' is what you pass to 'lm' via 'subset' argument
## it can either be a logical vector of length 'nrow(dat)'
## or a shorter positive integer vector giving position index
## note however, 'base::subset' expects logical vector for 'subset' argument
## so a rigorous check is necessary here
if (mode(subset_vec) == "logical") {
  if (length(subset_vec) != nrow(dat)) {
    stop("'logical' 'subset_vec' provided but length does not match 'nrow(dat)'")
    }
  subset_log_vec <- subset_vec
  } else if (mode(subset_vec) == "numeric") {
  ## check range
  ran <- range(subset_vec)
  if (ran[1] < 1 || ran[2] > nrow(dat)) {
    stop("'numeric' 'subset_vec' provided but values are out of bound")
    } else {
    subset_log_vec <- logical(nrow(dat))
    subset_log_vec[as.integer(subset_vec)] <- TRUE
    } 
  } else {
  stop("'subset_vec' must be either 'logical' or 'numeric'")
  }
dat <- base::subset(dat, subset = subset_log_vec)

      

Step 1: delete unfinished cases

dat <- na.omit(dat)

      

You can skip this step if you went through step 0, as it automatically removes unfinished cases . subset

Step 2: check the mode and convert

A data frame column is typically an atomic vector with a mode of boolean, numeric, complex, symbol, raw. Variables of different modes are treated differently for regression.

"logical",   it depends
"numeric",   nothing to do
"complex",   not allowed by 'model.matrix', though allowed by 'model.frame'
"character", converted to "numeric" with "factor" class by 'model.matrix'
"raw",       not allowed by 'model.matrix', though allowed by 'model.frame'

      

The boolean variable is tricky. It can either be treated as a dummy variable ( 1

for TRUE

; 0

for FALSE

), hence "numeric", or it can be coerced into a two-level coefficient. It all depends on whether the model.matrix

coercion counts as "factoring" from your model's formula specification. For the sake of simplicity, we can understand this as such: it is always reduced to a factor, but the result of applying contrasts may end up in the same model matrix as if it were handled as a dummy directly.

Some people may wonder why "integer" is not included. Because an integer vector, like 1:4

, has a "numeric" mode (try it mode(1:4)

).

The column of the data frame can also be an "AsIs" matrix, but such a matrix must be in "numeric" mode.

Our check is to make a mistake when

  • found "hard" or "raw";
  • found matrix variable "boolean" or "symbolic";

and start converting "boolean" and "symbol" into "numeric" class "factor".

## get mode of all vars
var_mode <- sapply(dat, mode)

## produce error if complex or raw is found
if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")

## get class of all vars
var_class <- sapply(dat, class)

## produce error if an "AsIs" object has "logical" or "character" mode
if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
  stop("matrix variables with 'AsIs' class must be 'numeric'")
  }

## identify columns that needs be coerced to factors
ind1 <- which(var_mode %in% c("logical", "character"))

## coerce logical / character to factor with 'as.factor'
dat[ind1] <- lapply(dat[ind1], as.factor)

      

Note that if the column of the dataframe is already a factor variable, it will not be included ind1

because the factor variable is in "numeric" mode (try it mode(factor(letters[1:4]))

).

Step 3: reset unused factor levels

We will not have unused factorial levels for factorial variables transformed from step 2, i.e. ind1

Indexed by ind1

. However, the factorial variables that come with dat

may have unused levels (often as a result of step 0 and step 1). We need to discard any possible unused levels from them.

## index of factor columns
fctr <- which(sapply(dat, is.factor))

## factor variables that have skipped explicit conversion in step 2
## don't simply do 'ind2 <- fctr[-ind1]'; buggy if 'ind1' is 'integer(0)'
ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr

## drop unused levels
dat[ind2] <- lapply(dat[ind2], droplevels)

      

step 4: sum up the factorial variables

We are now ready to see what and how many factor levels are actually used lm

or glm

:

## export factor levels actually used by 'lm' and 'glm'
lev <- lapply(dat[fctr], levels)

## count number of levels
nl <- lengths(lev)

      


To make your life easier, I've wrapped these steps in a function debug_contr_error

.

Input data :

  • dat

    - your data frame, passed in lm

    or glm

    through an argument data

    ;
  • subset_vec

    - an index vector passed in lm

    or glm

    through an argument subset

    .

Conclusion: list with

  • nlevels

    (list) gives the number of factor levels for all factor factors;
  • levels

    (vector) gives levels for all factorial variables.

The function issues a warning if there are no complete cases or factorial variables to summarize.

debug_contr_error <- function (dat, subset_vec = NULL) {
  if (!is.null(subset_vec)) {
    ## step 0
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' 'subset_vec' provided but length does not match 'nrow(dat)'")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' 'subset_vec' provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("'subset_vec' must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    } else {
    ## step 1
    dat <- stats::na.omit(dat)
    }
  if (nrow(dat) == 0L) warning("no complete cases")
  ## step 2
  var_mode <- sapply(dat, mode)
  if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")
  var_class <- sapply(dat, class)
  if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
    stop("matrix variables with 'AsIs' class must be 'numeric'")
    }
  ind1 <- which(var_mode %in% c("logical", "character"))
  dat[ind1] <- lapply(dat[ind1], as.factor)
  ## step 3
  fctr <- which(sapply(dat, is.factor))
  if (length(fctr) == 0L) warning("no factor variables to summary")
  ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr
  dat[ind2] <- lapply(dat[ind2], base::droplevels.factor)
  ## step 4
  lev <- lapply(dat[fctr], base::levels.default)
  nl <- lengths(lev)
  ## return
  list(nlevels = nl, levels = lev)
  }

      

Here's a tiny example built.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

#  y  x f1 f2
#1 1  1  a  A
#2 2  2  a  A
#3 3  3  b  A
#4 4 NA  b  B

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: chr  "A" "A" "A" "B"

lm(y ~ x + f1 + f2, dat)
#Error in 'contrasts<-'('*tmp*', value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

      

Ok, we see an error. Now my debug_contr_error

that f2

ends in one level.

debug_contr_error(dat)
#$nlevels
#f1 f2 
# 2  1 
#
#$levels
#$levels$f1
#[1] "a" "b"
#
#$levels$f2
#[1] "A"

      

Note that the original short answer is hopeless here as it is f2

provided as a symbolic variable and not as a factorial variable.

## old answer
tmp <- na.omit(dat)
fctr <- lapply(tmp[sapply(tmp, is.factor)], droplevels)
sapply(fctr, nlevels)
#f1 
# 2 
rm(tmp, fctr)

      

Now let's look at an example with a matrix variable x

.

dat <- data.frame(X = I(rbind(matrix(1:6, 3), NA)),
                  f = c("a", "a", "a", "b"),
                  y = 1:4)

dat
#  X.1 X.2 f y
#1   1   4 a 1
#2   2   5 a 2
#3   3   6 a 3
#4  NA  NA b 4

str(dat)
#'data.frame':  4 obs. of  3 variables:
# $ X: 'AsIs' int [1:4, 1:2] 1 2 3 NA 4 5 6 NA
# $ f: Factor w/ 2 levels "a","b": 1 1 1 2
# $ y: int  1 2 3 4

lm(y ~ X + f, data = dat)
#Error in 'contrasts<-'('*tmp*', value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f 
#1

      

Note that a factor variable without levels can also cause a "contrast error". You may be wondering how a zero level factor is possible. Well, it's legal: nlevels(factor(character(0)))

. This is where you get 0-level factors if you have no completed cases.

dat <- data.frame(y = 1:4,
                  x = rep(NA_real_, 4),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

lm(y ~ x + f1 + f2, dat)
#Error in 'contrasts<-'('*tmp*', value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f1 f2 
# 0  0    ## all values are 0
#Warning message:
#In debug_contr_error(dat) : no complete cases

      

Finally, let's look at the situation when f2

is a boolean variable.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c(TRUE, TRUE, TRUE, FALSE))

dat
#  y  x f1    f2
#1 1  1  a  TRUE
#2 2  2  a  TRUE
#3 3  3  b  TRUE
#4 4 NA  b FALSE

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: logi  TRUE TRUE TRUE FALSE

      

Our debugger will predict a "contrasts error", but will this really happen?

debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 

      

No, at least it doesn't crash (the coefficient NA

is due to a lack of model rank; don't worry
):

lm(y ~ x + f1 + f2, data = dat)
#Coefficients:
#(Intercept)            x          f1b       f2TRUE  
#          0            1            0           NA

      

It is difficult for me to give an example with an error, but this is also unnecessary. In practice, we don't use a debugger for prediction; we use it when we actually get an error; in which case the debugger can find the offending factor variable.

Perhaps some might argue that a boolean variable is no different from a dummy variable. But try the simple example below: it depends on your formula.

u <- c(TRUE, TRUE, FALSE, FALSE)
v <- c(1, 1, 0, 0)  ## "numeric" dummy of 'u'

model.matrix(~ u)
#  (Intercept) uTRUE
#1           1     1
#2           1     1
#3           1     0
#4           1     0

model.matrix(~ v)
#  (Intercept) v
#1           1 1
#2           1 1
#3           1 0
#4           1 0

model.matrix(~ u - 1)
#  uFALSE uTRUE
#1      0     1
#2      0     1
#3      1     0
#4      1     0

model.matrix(~ v - 1)
#  v
#1 1
#2 1
#3 0
#4 0

      




More flexible implementation using the method "model.frame"

lm

You are also encouraged to go through R: How to Debug the Factor Has New Levels Error for Linear Model and Prediction , which explains what is being done lm

and glm

under the hood in your dataset. You will find that steps 0 through 4 above are simply trying to mimic such an internal process. Remember that the data that is actually used to fit the model can be very different from the data you provided .

Our steps do not fully match this internal processing. For comparison, you can get the result of internal processing using method = "model.frame"

in lm

and glm

. Try this on the previously built tiny data example dat

where f2

is a character variable.

dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")

dat_internal
#  y x f1 f2
#1 1 1  a  A
#2 2 2  a  A
#3 3 3  b  A

str(dat_internal)
#'data.frame':  3 obs. of  4 variables:
# $ y : int  1 2 3
# $ x : int  1 2 3
# $ f1: Factor w/ 2 levels "a","b": 1 1 2
# $ f2: chr  "A" "A" "A"
## [.."terms" attribute is truncated..]

      

In practice, model.frame

only step 0 and step 1 will do. It also removes the variables provided in your dataset, but not in the model formula. Thus, the model frame can have fewer rows and columns than lm

and glm

. Forcing the typecast, as we did in our step 2, is done at a later date model.matrix

where a "contrast error" may occur.

There are several advantages: first get this inner model frame and then pass it to debug_contr_error

(so that it essentially only does steps 2-4).

Advantage 1: Variables not used in the model formula are ignored

## no variable 'f1' in formula
dat_internal <- lm(y ~ x + f2, dat, method = "model.frame")

## compare the following
debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 

debug_contr_error(dat_internal)$nlevels
#f2 
# 1 

      

Advantage 2: the ability to deal with transformed variables

It is permissible to convert variables to a model formula, and model.frame

will write the converted variables instead of the original ones. Note that even if your original variable doesn't have NA

, the converted one may have.

dat <- data.frame(y = 1:4, x = c(1:3, -1), f = rep(letters[1:2], c(3, 1)))
#  y  x f
#1 1  1 a
#2 2  2 a
#3 3  3 a
#4 4 -1 b

lm(y ~ log(x) + f, data = dat)
#Error in 'contrasts<-'('*tmp*', value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels
#In addition: Warning message:
#In log(x) : NaNs produced

# directly using 'debug_contr_error' is hopeless here
debug_contr_error(dat)$nlevels
#f 
#2 

## this works
dat_internal <- lm(y ~ log(x) + f, data = dat, method = "model.frame")
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a

debug_contr_error(dat_internal)$nlevels
#f 
#1

      

With these advantages in mind, I am writing another function that wraps model.frame

and debug_contr_error

.

Login :

  • form

    - this is your model formula;
  • dat

    - data set passed to lm

    or glm

    through an argument data

    ;
  • subset_vec

    - an index vector passed in lm

    or glm

    through an argument subset

    .

Conclusion: list with

  • mf

    (data frame) gives the model frame (with the "Terms" attribute omitted);
  • nlevels

    (list) gives the number of factor levels for all factor factors;
  • levels

    (vector) gives levels for all factorial variables.

## note: this function relies on 'debug_contr_error'
debug_contr_error2 <- function (form, dat, subset_vec = NULL) {
  ## step 0
  if (!is.null(subset_vec)) {
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' 'subset_vec' provided but length does not match 'nrow(dat)'")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' 'subset_vec' provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("'subset_vec' must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    }
  ## step 0 and 1
  dat_internal <- stats::lm(form, data = dat, method = "model.frame")
  attr(dat_internal, "terms") <- NULL
  ## rely on 'debug_contr_error' for steps 2 to 4
  c(list(mf = dat_internal), debug_contr_error(dat_internal, NULL))
  }

      

Try the previous conversion example log

.

debug_contr_error2(y ~ log(x) + f, dat)
#$mf
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced

      

Try also subset_vec

.

## or: debug_contr_error2(y ~ log(x) + f, dat, c(T, F, T, T))
debug_contr_error2(y ~ log(x) + f, dat, c(1,3,4))
#$mf
#  y   log(x) f
#1 1 0.000000 a
#3 3 1.098612 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced

      


Fitting the model by group and NA as factor levels

If you fit a model by group, you are likely to get a "contrasts error". You need

  1. split your data frame using variable grouping (see ?split.data.frame

    );
  2. work on these data frames one by one applying debug_contr_error2

    (the function lapply

    can be useful for doing this loop).

Some also told me that they couldn't use na.omit

in their data because there would end up with too few rows to do anything sensible.
It can be relaxed. In practice it is NA_integer_

and NA_real_

which should be omitted, but NA_character_

can be retained: just add NA

as factor level. To do this, you need to iterate over the variables in your dataframe:

  • if the variable is x

    already a factor, and anyNA(x)

    isTRUE

    , do x <- addNA(x)

    . "And" is important. If x

    not NA

    , addNA(x)

    add an unused level <NA>

    .
  • if the variable x

    is a symbol, execute x <- factor(x, exclude = NULL)

    to cast it to a factor. exclude = NULL

    will save <NA>

    as a level.
  • if it x

    is "boolean", "numeric", "raw", or "complex", nothing should be changed. NA

    it's simple NA

    .

The factor level <NA>

will not be reset either by droplevels

or by na.omit

, and it is valid for constructing the model matrix. Check out the following examples.

## x is a factor with NA

x <- factor(c(letters[1:4], NA))  ## default: 'exclude = NA'
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

na.omit(x)  ## NA is gone
#[1] a b c d
#[.. attributes truncated..]
#Levels: a b c d

x <- addNA(x)  ## now add NA into a valid level
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>  ## it appears here

droplevels(x)    ## it can not be dropped
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

na.omit(x)  ## it is not omitted
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

model.matrix(~ x)   ## and it is valid to be in a design matrix
#  (Intercept) xb xc xd xNA
#1           1  0  0  0   0
#2           1  1  0  0   0
#3           1  0  1  0   0
#4           1  0  0  1   0
#5           1  0  0  0   1

      

## x is a character with NA

x <- c(letters[1:4], NA)
#[1] "a" "b" "c" "d" NA 

as.factor(x)  ## this calls 'factor(x)' with default 'exclude = NA'
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

factor(x, exclude = NULL)      ## we want 'exclude = NULL'
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>          ## now NA is a level

      

Once you add NA

as factor / character level, your dataset may suddenly have more complete cases. Then you can run your model. If you still get the "contrasts error" use debug_contr_error2

to see what happened.

For your convenience, I am writing a preprocessing function NA

.

Login :

  • dat

    is your complete dataset.

Output:

  • data frame, with NA added as a factor / symbol level.

NA_preproc <- function (dat) {
  for (j in 1:ncol(dat)) {
    x <- dat[[j]]
    if (is.factor(x) && anyNA(x)) dat[[j]] <- base::addNA(x)
    if (is.character(x)) dat[[j]] <- factor(x, exclude = NULL)
    }
  dat
  }

      


Reproducible case studies and discussions

The following materials are specially selected for reproducible case studies as I just answered them with the three helper functions created here.

There are also several other high quality streams tackled by other StackOverflow users:

This answer aims to debug "contrasts errors" during model fitting. However, this error can appear when used predict

for forecasting. This behavior is not with the help of predict.lm

or predict.glm

, but with the prediction methods from some packages. Here are some related topics on StackOverflow.

Also note that the philosophy of this answer is based on philosophy lm

and glm

. These two functions are the coding standard for many modeling routines , but it may not be possible for all modeling routines to behave the same. For example, the following doesn't look transparent to me if my helper functions would be helpful.

While this is a bit off-topic, it is still useful to know that sometimes a "contrast error" simply occurs from writing the wrong piece of code. In the following examples, OP gave the name of its variables, not their value lm

. Since the name is a single-valued character, it is later coerced into a single-level factor and generates an error.


How can I resolve this error after debugging?

In practice, people want to know how to solve this issue, either at the statistical level or at the programming level.

If you are fitting models in your full dataset, then there is probably no statistical solution, unless you can calculate missing values ​​or collect more data. So you can just refer to the coding solution to remove the invalid variable. debug_contr_error2

returns nlevels

to help you find them easily. If you don't want to discard them, replace them with vector 1 (as described in How to make GLM when "contrasts can only be applied to factors with 2 or more levels"? ) And resolve lm

or glm

deal with the resulting rank-deficit.

If you fit models on a subset, there may be statistical decisions.

Fitting models into groups does not necessarily require splitting the dataset into groups and fitting independent models. The following might give you a rough idea:

If you divide your data explicitly, you can easily get a "contrast error", so you have to adjust the model formula for each group (that is, you need to dynamically generate model formulas). An easier solution is to skip building the model for this group.

You can also arbitrarily split your dataset into training subset and testing subset so you can cross-validate. R: How to debug "factor has new levels" error for linear model and prediction, it is briefly mentioned and you better make a stratified sampling to ensure the success of both model estimation in the training part and forecasting in the test part.

+31


source


Perhaps, as a very quick step, you need to make sure that you do have at least 2 factors. The quick way I found was:



df %>% dplyr::mutate_all(as.factor) %>% str

      

0


source







All Articles