How to create a function that will split continuous variables only into groups with the same size

I would like to run a function over my dataframe that will only find contiguous variables and add new categorical variables based on dividing contiguous variables into 2 groups of equal size. I have some code that I am using to split a variable into groups and add it as a new categorical variable, but when I tried to use it in a function, it doesn't work. Also, how can I avoid using non-contiguous variables? Here is the toy data frame:

df <- read.table(text = "         birds    wolfs     
                                    9         7    
                                    8         4    
                                    2         8    
                                    2         3    
                                    8         3    
                                    1         2    
                                    7         1    
                                    1         5    
                                    9         7    
                                    8         7     ",header = TRUE)

      

my function:

for (i in names(df)) function (x) { as.factor( as.numeric( cut(df$i,2)))  }

      

+3


source to share


1 answer


Here are some possible problems with your function

for (i in names(df)) function (x) { as.factor( as.numeric( cut(df$i,2)))  }

      

  • I would use df[,i]

    for a subset of the column instead df$i

    as it was not properly evaluated
  • No need for anonymous function call function(x)

    .
  • The output is not stored in another variable.

The first two can be easily fixed. We create an empty object list

with an length

equal number of 'df' ( ncol(df)

) columns . This can be used to store results ('lst')

lst <- vector('list', ncol(df))

      

Now we loop through the 'df' columns (assuming all columns are numeric) and apply a function cut

to each column ( cut(df[,i],..

).

for(i in seq_along(df)) {
        lst[[i]] <- as.factor(as.numeric(cut(df[,i], 2)))
 }

      

We can assign new columns with the output 'lst'

df[paste0(names(df), 'new')] <- lst

      

Another option instead of a loop for

would be lapply

. Results from lapply

can be directly linked to new columns.



df[paste0(names(df), 'new')] <- lapply(df, function(x)
                   factor(cut(x, 2, labels=FALSE)))

      


Based on OP's comments about filtering only columns numeric

(even excluding binary columns) to apply cut

. We create a boolean index with vapply

. It goes through columns "df2" and checks if it is "numeric" ( is.numeric(x)

) and if it contains values ​​other than 0, 1 ( !all(x %in% 0:1)

).

 indx <- vapply(df2, function(x) !all(x %in% 0:1) & is.numeric(x), logical(1L))

      

Using the same code as above including the indx vector

   lst <- vector('list', ncol(df2[indx]))
   for(i in seq_along(df2[indx])) {
       lst[[i]] <- as.factor(as.numeric(cut(df2[indx][,i], 2)))
    }
  df2[paste0(names(df2)[indx], 'new')] <- lst

      

Or using lapply

 df2[paste0(names(df2)[indx], 'new')] <- lapply(df2[indx],
                  function(x) factor(cut(x, 2, labels=FALSE)))

      

data

set.seed(24)
df1 <- data.frame(col1=sample(0:1, 10, replace=TRUE),
           col2=rnorm(10), col3=letters[1:10])
#df - OP dataset

df2 <- cbind(df1, df)

      

+1


source







All Articles