Are loops evil in R?

I've heard that you shouldn't force the procedural style of programming into R. I find it quite difficult. I just solved my for loop problem. It is not right? Is there a better, more "R-style" solution?

Problem: I have two columns, Col1 and Col2. Col1 contains job titles that have been entered in free form. I want to use Col2 to collect these job titles in a category (so that "Junior Technician", "Engineer Technician" and "Mechanic." Are all listed as "Technician".

I did it like this:

jobcategories<-list(
"Junior Technician|Engineering technician|Mech. tech." = "Technician",
"Manager|Senior Manager|Group manager|Pain in the ****" = "Manager",
"Admin|Administrator|Group secretary" = "Administrator")

for (currentjob in names(jobcategories)) {
  df$Col2[grep(currentjob,data$Col1)] <- jobcategories[[currentjob]]
}

      

This gives the correct results, but I cannot shake the feeling that (due to my procedural experience) I am not using R properly. Can Expert R save me from my misery?

EDIT

I was asked to provide the original data. Unfortunately, I cannot provide it because it contains confidential information. It's basically two columns. The first column contains over 400 lines of different job titles (and an odd personal name). There are about 20 different categories, which can be divided into 400 titles. The second column starts with NA and is then populated after the for loop starts.

+3


source to share


3 answers


Cycles

for

are not "evil" in R, but they are generally slow compared to vector methods and are often not the best solution available, however they are easy to implement and easy to understand, and you shouldn't underestimate the value of any of them.



So in my opinion you should use a loop for

if you need to do something quickly and can't see the best way to do it and you don't need to worry too much about speed.

+1


source


You are correct that it is often not recommended for loops in R, and in my experience this happens for two main reasons:

Growing objects

As eloquently described in circle 2 R inferno , this can be extremely ineffective for growing an object one element at a time, as it is often a temptation for loops. For example, this is a fairly common but inefficient workflow because it redistributes output

each iteration of the loop:

output <- c()
for (idx in indices) {
  scalar <- compute.new.scalar(idx)
  output <- c(output, scalar)
}

      

This inefficiency can be removed by preallocating output

to the desired size and using a for loop, or by using a type function sapply

.

Lack of accelerated vectorized alternatives



A second source of inefficiency is executing a for loop for a fast operation when a vectorized alternative exists. For example, consider the following code:

s <- 0
for (elt in x) {
  s <- s + elt
}

      

This is a loop for a very fast operation (adding two numbers) and the overhead of the loop will be significant compared to a vectorized function sum

that sums all the elements in a vector. The function is sum

fast because it is implemented in C, so it will be more efficient to do s <- sum(x)

than using a for loop (not to mention less typing). Sometimes it takes more creativity to figure out how to replace the for loop with a fast interior with a vectorized alternative ( cumsum

anddiff

, come a lot), but this can lead to significant efficiency gains. In cases where you have an internal fast loop scope but cannot figure out how to use vectorized functions to achieve the same, I found that re-executing the loop with the Rcpp package can provide a faster alternative.

In short ...

For loops, it can be slow if you are enlarging objects incorrectly or you have a very fast loop interior, and all this can be replaced with a vectorized operation. Otherwise, you probably won't lose too much efficiency, since the function family also loops from the inside.

+7


source


Usually you will find that there is no way for a loop to do things.

For example:

If you create a simple table mapping your old jobs to the new ones:

job_map <- data.frame(
  current = c("Junior Technician", "Engineering technician", "Mech. tech.",
              "Manager", "Senior Manager", "Group manager", "Pain in the ****",
              "Admin", "Administrator", "Group secretary"),
  new = c(rep("Technician",3), rep("Manager",4), rep("Administrator",3))
)

      

And you had a table of assignments to reclassify:

my_df <- data.frame(job_name = sample(job_map$current, 50, replace = TRUE))

      

The Compliance Team will help you:

my_df$new <- job_map$new[match(my_df$job_name, job_map$current)]    
my_df

      

+2


source







All Articles