R: Split variable column into multiple (unbalanced) comma columns
I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of several different "categories" that I want to split into where it shows 1 category per column (similar to what split would do in stata). For example:
# Name Age Number Events First
# Karen 24 8 Triathlon/IM,Marathon,10k,5k 0
# Kurt 39 2 Half-Marathon,10k 0
# Leah 18 0 1
And I want it to look like this:
# Name Age Number Events_1 Event_2 Events_3 Events_4 First
# Karen 24 8 Triathlon/IM Marathon 10k 5k 0
# Kurt 39 2 Half-Marathon 10k NA NA 0
# Leah 18 0 NA NA NA NA 1
I looked at stackoverflow but didn't find anything that works (all of this is giving me some kind of error). Any suggestions would be appreciated.
Note. Maybe it doesn't matter, but the largest number of categories for 1 person is 19, so I need to create Event_1: Event_19
Commentary: The previous ones suggested a separate feature, however this feature doesn't seem to work with my dataset. When I enter a function that the program runs, but when it's finished nothing changes, no exit and no error code. When I tried to use other suggestions made in other threads I got error messages. However, I finally got it working with the cSplit function. Thanks for the help!!!
source to share
From Ananda package splitstackshape
:
cSplit(df, "Events", sep=",")
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Or with tidyr
:
separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
# Name Age Number Events_1 Events_2 Events_3 Events_4 First
#1 Karen 24 8 Triathlon/IM Marathon 10k 5k 0
#2 Kurt 39 2 Half-Marathon 10k <NA> <NA> 0
#3 Leah 18 0 NA <NA> <NA> <NA> 1
With package data.table
:
setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Data
df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt",
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L,
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c(" NA",
" Half-Marathon,10k", " Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name",
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA,
-3L))
source to share