How to create a function that will split continuous variables only into groups with the same size
I would like to run a function over my dataframe that will only find contiguous variables and add new categorical variables based on dividing contiguous variables into 2 groups of equal size. I have some code that I am using to split a variable into groups and add it as a new categorical variable, but when I tried to use it in a function, it doesn't work. Also, how can I avoid using non-contiguous variables? Here is the toy data frame:
df <- read.table(text = " birds wolfs
9 7
8 4
2 8
2 3
8 3
1 2
7 1
1 5
9 7
8 7 ",header = TRUE)
my function:
for (i in names(df)) function (x) { as.factor( as.numeric( cut(df$i,2))) }
source to share
Here are some possible problems with your function
for (i in names(df)) function (x) { as.factor( as.numeric( cut(df$i,2))) }
- I would use
df[,i]
for a subset of the column insteaddf$i
as it was not properly evaluated - No need for anonymous function call
function(x)
. - The output is not stored in another variable.
The first two can be easily fixed. We create an empty object list
with an length
equal number of 'df' ( ncol(df)
) columns . This can be used to store results ('lst')
lst <- vector('list', ncol(df))
Now we loop through the 'df' columns (assuming all columns are numeric) and apply a function cut
to each column ( cut(df[,i],..
).
for(i in seq_along(df)) {
lst[[i]] <- as.factor(as.numeric(cut(df[,i], 2)))
}
We can assign new columns with the output 'lst'
df[paste0(names(df), 'new')] <- lst
Another option instead of a loop for
would be lapply
. Results from lapply
can be directly linked to new columns.
df[paste0(names(df), 'new')] <- lapply(df, function(x)
factor(cut(x, 2, labels=FALSE)))
Based on OP's comments about filtering only columns numeric
(even excluding binary columns) to apply cut
. We create a boolean index with vapply
. It goes through columns "df2" and checks if it is "numeric" ( is.numeric(x)
) and if it contains values ββother than 0, 1 ( !all(x %in% 0:1)
).
indx <- vapply(df2, function(x) !all(x %in% 0:1) & is.numeric(x), logical(1L))
Using the same code as above including the indx vector
lst <- vector('list', ncol(df2[indx]))
for(i in seq_along(df2[indx])) {
lst[[i]] <- as.factor(as.numeric(cut(df2[indx][,i], 2)))
}
df2[paste0(names(df2)[indx], 'new')] <- lst
Or using lapply
df2[paste0(names(df2)[indx], 'new')] <- lapply(df2[indx],
function(x) factor(cut(x, 2, labels=FALSE)))
data
set.seed(24)
df1 <- data.frame(col1=sample(0:1, 10, replace=TRUE),
col2=rnorm(10), col3=letters[1:10])
#df - OP dataset
df2 <- cbind(df1, df)
source to share