Effective replacement for-loop when splitting strings in R

I have a large framework (20 columns,> 100k rows) and need to split a column of character strings into multiple new columns.

The first 3 observations in this column look something like this:

scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour", 
                 "L-S Spine,L-S Spine"))

      

which looks like this:

                          scan
1 CT Cervical Sp,CT Head Plain
2                  II < 1 Hour
3          L-S Spine,L-S Spine

      

I need to split this into 5 columns (each observation has a maximum of 5 substrings), and for observations with fewer substrings, I want the remaining columns to fill with NA. I am currently using this code:

scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour",
"L-S Spine,L-S Spine"))

for(i in 1:nrow(scans)){
  scans$scan1[i] <- strsplit(scans$scan, ",")[[i]][1]
  scans$scan2[i] <- strsplit(scans$scan, ",")[[i]][2]
  scans$scan3[i] <- strsplit(scans$scan, ",")[[i]][3]
  scans$scan4[i] <- strsplit(scans$scan, ",")[[i]][4]
  scans$scan5[i] <- strsplit(scans$scan, ",")[[i]][5]
}

      

which works and outputs my desired solution:

                          scan          scan1         scan2 scan3 scan4 scan5
1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain    NA    NA    NA
2                  II < 1 Hour    II < 1 Hour            NA    NA    NA    NA
3          L-S Spine,L-S Spine      L-S Spine     L-S Spine    NA    NA    NA

      

... but it is very slow. Cycling over tens or hundreds of thousands of observations takes a long time.

Thanks a lot for any advice.

+3


source to share


3 answers


Another way is to use tstrsplit

in version version data.table

library(data.table) # v >= 1.9.5
setDT(scans)[, tstrsplit(scan, ",", fixed = TRUE)]
#                V1            V2
# 1: CT Cervical Sp CT Head Plain
# 2:    II < 1 Hour            NA
# 3:      L-S Spine     L-S Spine 

      

If you are sure that you will have 5 sections at least once, you can easily create these columns by reference

setDT(scans)[, paste0("scan", 1:5) := tstrsplit(scan, ",")]

      




Alternatively, the package tidyr

offers similar functionality

library(tidyr)
separate(scans, scan, paste0("scan", 1:2), ",", extra = "merge", remove = FALSE)
#                           scan          scan1         scan2
# 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
# 2                  II < 1 Hour    II < 1 Hour          <NA>
# 3          L-S Spine,L-S Spine      L-S Spine     L-S Spine

      




Or another option using only base R

 cbind(scans, read.table(text= as.character(scans$scan),sep=",", fill=TRUE, na.strings=''))

      

+4


source


You can use:

library(splitstackshape)
cSplit(scans, colnames(scans), sep=',')

#           scan_1        scan_2
#1: CT Cervical Sp CT Head Plain
#2:    II < 1 Hour            NA
#3:      L-S Spine     L-S Spine

      



Beware of what the returned object is data.table

. You can convert to data.frame

. There are only two columns here because there is only at most one comma in the data. If you apply it to data with some cells with 4 commas, you get the desired result.

+3


source


Use the awesome package stringi

- I urge everyone to find a faster solution.

# this does all the work
result <- as.data.frame(stringi::stri_split_fixed(scans$scan, ",", simplify = TRUE))

      

This will fill in as many columns as you have comma separators.

To get accurate results from the question, rename the columns and convert empty strings to NA

:

# rename the columns if you wish
names(result) <- paste0("scan", 1:ncol(result))
# replace "" with NA
result[result==""] <- NA

cbind(scans, result)
##                           scan          scan1         scan2
## 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
## 2                  II < 1 Hour    II < 1 Hour          <NA>
## 3          L-S Spine,L-S Spine      L-S Spine     L-S Spine

      

+2


source







All Articles