Convert comma separated columns to boolean columns
I have the following comma separated data in one of my data.frame columns called services
.
> dput(structure(df$services[1:5]))
list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management",
"Removal Services, Exception & Cost Admin, Global Cost Estimate, Company Privacy Policy",
"Removal Services, Exception & Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy",
"Global Expense Management, Company Privacy Policy")
I would like to convert this data to separate columns in my dataframe and if the row contains this service set TRUE on that service column. Otherwise, set the value to FALSE.
For example, if I wanted my dataframe to look like this:
GlobalExpenseManagement | CompanyPrivacyPolicy | etc...
TRUE TRUE
TRUE FALSE
FALSE TRUE
I guess I will have to split the values ββwith comma-sep, group them to remove duplicates, and then add them as names(df)
in my framework. However, I don't know how to iterate over the dataset and set true / false if the row contains this service.
Does anyone have any good ideas to do this?
Edit: merging data back
I am now trying to merge a new matrix with an existing file frame to replace the services with my new columns. I tried this based on @plafort's excellent answer below:
names(df) <- headnames rbind(mat, df)
However, I am getting this error:
Name error (df) <- headnames: attribute 'names' [178] must be the same length as vector [7]
I also tried this:
final <- data.frame(cbind(mat, df))
But the columns seem to be missing from df
. How do I concatenate columns from mat
to df
?
source to share
I would consider cSplit_e
from my splitstackshape package. The result is obtained as binary "1" and "0" instead of TRUE
and FALSE
, but this should be easy to convert.
Sample data:
df <- data.frame(services = I(
list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management",
"Removal Services, Exception & Cost Admin, Global Cost Estimate, Company Privacy Policy",
"Removal Services, Exception & Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy",
"Global Expense Management, Company Privacy Policy")))
Convert the "services" column to vector
instead list
:
df$services <- unlist(df$services)
Now split it up:
library(splitstackshape)
cSplit_e(df, "services", ",", type = "character", fill = 0)
## services
## 1 Global Expense Management, Company Privacy Policy
## 2 Removal Services, Global Expense Management
## 3 Removal Services, Exception & Cost Admin, Global Cost Estimate, Company Privacy Policy
## 4 Removal Services, Exception & Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy
## 5 Global Expense Management, Company Privacy Policy
## services_Ancillary Services services_Company Privacy Policy services_Exception & Cost Admin
## 1 0 1 0
## 2 0 0 0
## 3 0 1 1
## 4 1 1 1
## 5 0 1 0
## services_Global Cost Estimate services_Global Expense Management services_Perm Storage
## 1 0 1 0
## 2 0 1 0
## 3 1 0 0
## 4 1 1 1
## 5 0 1 0
## services_Removal Services
## 1 0
## 2 1
## 3 1
## 4 1
## 5 0
source to share
Try:
splitup <- sapply(unlist(lst), strsplit, ', ')
headnames <- unique(unlist(splitup))
(mat <- t(unname(sapply(splitup, function(x) headnames %in% x))))
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
[3,] FALSE TRUE TRUE TRUE TRUE FALSE FALSE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
We'll start by splitting the data by comma and using it unlist
to directly access the elements. headnames
does, as you say, searches for unique category headings. The last line first maps the heading categories to each list item, then removes the auto-naming with unname
and passes the data back as we would like with t
.
To add names at the top, we assign unique names that were previously defined as column headers using the function colnames
. The ordering works correctly because it is the same vector headnames
that was used to observe the rows.
colnames(mat) <- headnames
Global Expense Management Company Privacy Policy
[1,] TRUE TRUE
[2,] TRUE FALSE
[3,] FALSE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE...
source to share