How to split a string between two characters into subgroups in R
I have a list of codes in the second column of a table and I want to extract some elements of each code and then store them in new columns associated with each of the codes. Each code consists of letters followed by some numbers. The letters P, F, I, R, C are repeated with the same order in all codes, but the number of digits changes in each code.
For example: consider the codes as below:
P1F2I235R15C145 P1 F2 I23 R15 C145
P24F1I12R124C96 P24 F1 I12 R124 C96
This way I can split each code into its subcodes and store those components into new columns in the same table. thank
source to share
Here's a possible solution stringi
library(stringi)
x <- c("P1F2I235R15C145","P24F1I12R124C96")
res <- stri_split_regex(x,"(?=([A-Za-z]=?))",perl = TRUE,simplify = TRUE,omit_empty = TRUE)
cbind.data.frame(x, res)
# x 1 2 3 4 5
# 1 P1F2I235R15C145 P1 F2 I235 R15 C145
# 2 P24F1I12R124C96 P24 F1 I12 R124 C96
source to share
Try the following:
#simulate your data frame
df<-data.frame(code=c("P1F2I235R15C145","P24F1I12R124C96"),stringsAsFactors=FALSE)
#split the columns
cbind(df,do.call(rbind,regmatches(df$code,gregexpr("[PFIRC][0-9]+",df$code))))
# code 1 2 3 4 5
#1 P1F2I235R15C145 P1 F2 I235 R15 C145
#2 P24F1I12R124C96 P24 F1 I12 R124 C96
What @AnandaMatho suggested in the comment was to let the letter in front of the code go away and name the columns appropriately. Something like that:
res<-cbind(df,do.call(rbind,regmatches(df$code,gregexpr("(?<=[PFIRC])[0-9]+",df$code,perl=TRUE))))
names(res)<-c("Code","P","F","I","R","C")
# Code P F I R C
#1 P1F2I235R15C145 1 2 235 15 145
#2 P24F1I12R124C96 24 1 12 124 96
source to share
A data.table
solution:
library(data.table)
dt<-data.table(code=c("P1F2I235R15C145","P24F1I12R124C96"))
dt[,c("P","F","I","R","C"):=
lapply(c("P","F","I","R","C"),
function(x)regmatches(code,regexpr(paste0(x,"[0-9]+"),code)))]
> dt
code P F I R C
1: P1F2I235R15C145 P1 F2 I235 R15 C145
2: P24F1I12R124C96 P24 F1 I12 R124 C96
And if you decide to drop the letters from the front, a little tweak:
dt[,c("P","F","I","R","C"):=
lapply(c("P","F","I","R","C"),
function(x)regmatches(code,regexpr(paste0("(?<=",x,")[0-9]+"),
code,perl=T)))]
> dt
code P F I R C
1: P1F2I235R15C145 1 2 235 15 145
2: P24F1I12R124C96 24 1 12 124 96
Or using the devel version data.table (v1.9.5+)
:
dt[, c("P", "F", "I", "R", "C") :=
tstrsplit(code, "(?<=.)(?=[[:alpha:]][0-9]+)", perl=TRUE)]
# code P F I R C
# 1: P1F2I235R15C145 P1 F2 I235 R15 C145
# 2: P24F1I12R124C96 P24 F1 I12 R124 C96
source to share