Convert comma separated columns to boolean columns

Question

Convert comma separated columns to boolean columns

I have the following comma separated data in one of my data.frame columns called services

.

> dput(structure(df$services[1:5]))
list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management", 
    "Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy", 
    "Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy", 
    "Global Expense Management, Company Privacy Policy")

I would like to convert this data to separate columns in my dataframe and if the row contains this service set TRUE on that service column. Otherwise, set the value to FALSE.

For example, if I wanted my dataframe to look like this:

GlobalExpenseManagement    |    CompanyPrivacyPolicy   |   etc...
TRUE                            TRUE
TRUE                            FALSE
FALSE                           TRUE

I guess I will have to split the values with comma-sep, group them to remove duplicates, and then add them as names(df)

in my framework. However, I don't know how to iterate over the dataset and set true / false if the row contains this service.

Does anyone have any good ideas to do this?

Edit: merging data back

I am now trying to merge a new matrix with an existing file frame to replace the services with my new columns. I tried this based on @plafort's excellent answer below:

names(df) <- headnames
rbind(mat, df)

However, I am getting this error:

Name error (df) <- headnames: attribute 'names' [178] must be the same length as vector [7]

I also tried this:

final <- data.frame(cbind(mat, df))

But the columns seem to be missing from df

. How do I concatenate columns from mat

to df

?

+3

r csv dataframe

user1477388 08 june 15 at 16:42

source to share

2 answers

Try:

splitup <- sapply(unlist(lst), strsplit, ', ')
headnames <- unique(unlist(splitup))
(mat <- t(unname(sapply(splitup, function(x) headnames %in% x))))

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[2,]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
[3,] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[4,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[5,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

We'll start by splitting the data by comma and using it unlist

to directly access the elements. headnames

does, as you say, searches for unique category headings. The last line first maps the heading categories to each list item, then removes the auto-naming with unname

and passes the data back as we would like with t

.

To add names at the top, we assign unique names that were previously defined as column headers using the function colnames

. The ordering works correctly because it is the same vector headnames

that was used to observe the rows.

colnames(mat) <- headnames

Global Expense Management Company Privacy Policy
[1,]                      TRUE                   TRUE
[2,]                      TRUE                  FALSE
[3,]                     FALSE                   TRUE
[4,]                      TRUE                   TRUE
[5,]                      TRUE                   TRUE...

+3

Pierre lafortune 08 june 15 at 17:08

source to share

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-06-08T17:45:55+0000

I would consider cSplit_e

from my splitstackshape package. The result is obtained as binary "1" and "0" instead of TRUE

and FALSE

, but this should be easy to convert.

Sample data:

df <- data.frame(services = I(
  list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management", 
       "Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy", 
       "Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy", 
       "Global Expense Management, Company Privacy Policy")))

Convert the "services" column to vector

instead list

:

df$services <- unlist(df$services)

Now split it up:

library(splitstackshape)
cSplit_e(df, "services", ",", type = "character", fill = 0)
##                                                                                                                                                  services
## 1                                                                                                       Global Expense Management, Company Privacy Policy
## 2                                                                                                             Removal Services, Global Expense Management
## 3                                                              Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy
## 4 Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy
## 5                                                                                                       Global Expense Management, Company Privacy Policy
##   services_Ancillary Services services_Company Privacy Policy services_Exception &amp; Cost Admin
## 1                           0                               1                                   0
## 2                           0                               0                                   0
## 3                           0                               1                                   1
## 4                           1                               1                                   1
## 5                           0                               1                                   0
##   services_Global Cost Estimate services_Global Expense Management services_Perm Storage
## 1                             0                                  1                     0
## 2                             0                                  1                     0
## 3                             1                                  0                     0
## 4                             1                                  1                     1
## 5                             0                                  1                     0
##   services_Removal Services
## 1                         0
## 2                         1
## 3                         1
## 4                         1
## 5                         0

Convert comma separated columns to boolean columns

Edit: merging data back

More articles: