Effective replacement for-loop when splitting strings in R
I have a large framework (20 columns,> 100k rows) and need to split a column of character strings into multiple new columns.
The first 3 observations in this column look something like this:
scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour",
"L-S Spine,L-S Spine"))
which looks like this:
scan
1 CT Cervical Sp,CT Head Plain
2 II < 1 Hour
3 L-S Spine,L-S Spine
I need to split this into 5 columns (each observation has a maximum of 5 substrings), and for observations with fewer substrings, I want the remaining columns to fill with NA. I am currently using this code:
scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour",
"L-S Spine,L-S Spine"))
for(i in 1:nrow(scans)){
scans$scan1[i] <- strsplit(scans$scan, ",")[[i]][1]
scans$scan2[i] <- strsplit(scans$scan, ",")[[i]][2]
scans$scan3[i] <- strsplit(scans$scan, ",")[[i]][3]
scans$scan4[i] <- strsplit(scans$scan, ",")[[i]][4]
scans$scan5[i] <- strsplit(scans$scan, ",")[[i]][5]
}
which works and outputs my desired solution:
scan scan1 scan2 scan3 scan4 scan5
1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain NA NA NA
2 II < 1 Hour II < 1 Hour NA NA NA NA
3 L-S Spine,L-S Spine L-S Spine L-S Spine NA NA NA
... but it is very slow. Cycling over tens or hundreds of thousands of observations takes a long time.
Thanks a lot for any advice.
source to share
Another way is to use tstrsplit
in version version data.table
library(data.table) # v >= 1.9.5
setDT(scans)[, tstrsplit(scan, ",", fixed = TRUE)]
# V1 V2
# 1: CT Cervical Sp CT Head Plain
# 2: II < 1 Hour NA
# 3: L-S Spine L-S Spine
If you are sure that you will have 5 sections at least once, you can easily create these columns by reference
setDT(scans)[, paste0("scan", 1:5) := tstrsplit(scan, ",")]
Alternatively, the package tidyr
offers similar functionality
library(tidyr)
separate(scans, scan, paste0("scan", 1:2), ",", extra = "merge", remove = FALSE)
# scan scan1 scan2
# 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
# 2 II < 1 Hour II < 1 Hour <NA>
# 3 L-S Spine,L-S Spine L-S Spine L-S Spine
Or another option using only base R
cbind(scans, read.table(text= as.character(scans$scan),sep=",", fill=TRUE, na.strings=''))
source to share
You can use:
library(splitstackshape)
cSplit(scans, colnames(scans), sep=',')
# scan_1 scan_2
#1: CT Cervical Sp CT Head Plain
#2: II < 1 Hour NA
#3: L-S Spine L-S Spine
Beware of what the returned object is data.table
. You can convert to data.frame
. There are only two columns here because there is only at most one comma in the data. If you apply it to data with some cells with 4 commas, you get the desired result.
source to share
Use the awesome package stringi
- I urge everyone to find a faster solution.
# this does all the work
result <- as.data.frame(stringi::stri_split_fixed(scans$scan, ",", simplify = TRUE))
This will fill in as many columns as you have comma separators.
To get accurate results from the question, rename the columns and convert empty strings to NA
:
# rename the columns if you wish
names(result) <- paste0("scan", 1:ncol(result))
# replace "" with NA
result[result==""] <- NA
cbind(scans, result)
## scan scan1 scan2
## 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
## 2 II < 1 Hour II < 1 Hour <NA>
## 3 L-S Spine,L-S Spine L-S Spine L-S Spine
source to share