Replacing strings with the dplyr lookup table
I am trying to create a lookup table in R to get my data in the same format as the company I work for.
He looks at the different education categories that I want to combine with dplyr.
library(dplyr)
# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
tbl_df(data)
# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
# Assign lookup table
data$X1 <- lut[data$X1]
But in my release, my old values ββare replaced with the wrong ones, i.e. not the ones I created in the lookup table. Rather, it seems like the lookup table is randomly assigned.
source to share
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
"Electric Engineering" = "Engineering",
"Political Science" = "Social Science",
"Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)
======================= ==============
X1 new
======================= ==============
Political Science Social Science
Political Science Social Science
Mechanichal Engineering Engineering
Mechanichal Engineering Engineering
Political Science Social Science
Political Science Social Science
======================= ==============
source to share
I found that the best way to do this is to use recode()
from packagecar
# Observe that dplyr also has a recode function, so require car after dplyr
require(dplyr)
require(car)
The data represent four training categories that are sampled from.
education <- c("Mechanichal Engineering",
"Electric Engineering","Political Science","Economics")
data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))
Using recode()
for data I will recode the categories
lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
"Electric Engineering" = "Engineering";
"Political Science" = "Social Science";
"Mechanichal Engineering" = "Engineering"'))
To make sure it is done correctly, attach the original data and the transcoded data
data <- full_join(data, lut, by = "ID")
head(data)
ID X1 X2
1 1 Political Science Social Science
2 2 Economics Social Science
3 3 Electric Engineering Engineering
4 4 Political Science Social Science
5 5 Economics Social Science
6 6 Mechanichal Engineering Engineering
With recode, you don't need to sort the data before re-encoding it.
source to share