R: Split variable column into multiple (unbalanced) comma columns

Question

R: Split variable column into multiple (unbalanced) comma columns

I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of several different "categories" that I want to split into where it shows 1 category per column (similar to what split would do in stata). For example:

# Name      Age     Number               Events                      First 
# Karen      24        8         Triathlon/IM,Marathon,10k,5k         0
# Kurt       39        2         Half-Marathon,10k                    0 
# Leah       18        0                                              1

And I want it to look like this:

# Name   Age  Number Events_1        Event_2      Events_3     Events_4      First
# Karen   24    8     Triathlon/IM    Marathon       10k         5k             0
# Kurt    39    2     Half-Marathon   10k            NA          NA             0 
# Leah    18    0     NA              NA             NA          NA             1

I looked at stackoverflow but didn't find anything that works (all of this is giving me some kind of error). Any suggestions would be appreciated.

Note. Maybe it doesn't matter, but the largest number of categories for 1 person is 19, so I need to create Event_1: Event_19

Commentary: The previous ones suggested a separate feature, however this feature doesn't seem to work with my dataset. When I enter a function that the program runs, but when it's finished nothing changes, no exit and no error code. When I tried to use other suggestions made in other threads I got error messages. However, I finally got it working with the cSplit function. Thanks for the help!!!

+3

split r

Kfruge 23 jul. 15 at 2:56

source to share

1 answer

Pierre lafortune · Accepted Answer · 2015-07-23T03:20:04+0000

From Ananda package splitstackshape

:

cSplit(df, "Events", sep=",")
#    Name Age Number First      Events_1 Events_2 Events_3 Events_4
#1: Karen  24      8     0  Triathlon/IM Marathon      10k       5k
#2:  Kurt  39      2     0 Half-Marathon      10k       NA       NA
#3: Leah   18      0     1            NA       NA       NA       NA

Or with tidyr

:

separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
#   Name Age Number               Events_1 Events_2 Events_3 Events_4 First
#1 Karen  24      8           Triathlon/IM Marathon      10k       5k     0
#2  Kurt  39      2          Half-Marathon      10k     <NA>     <NA>     0
#3 Leah   18      0                     NA     <NA>     <NA>     <NA>     1

With package data.table

:

setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
#    Name Age Number First               Events_1 Events_2 Events_3 Events_4
#1: Karen  24      8     0           Triathlon/IM Marathon      10k       5k
#2:  Kurt  39      2     0          Half-Marathon      10k       NA       NA
#3: Leah   18      0     1                     NA       NA       NA       NA

Data

df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt", 
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L, 
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c("               NA", 
"         Half-Marathon,10k", "     Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name", 
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA, 
-3L))

R: Split variable column into multiple (unbalanced) comma columns

More articles: