How to split a string between two characters into subgroups in R

I have a list of codes in the second column of a table and I want to extract some elements of each code and then store them in new columns associated with each of the codes. Each code consists of letters followed by some numbers. The letters P, F, I, R, C are repeated with the same order in all codes, but the number of digits changes in each code.

For example: consider the codes as below:

P1F2I235R15C145   P1   F2   I23   R15   C145
P24F1I12R124C96   P24  F1   I12   R124  C96

      

This way I can split each code into its subcodes and store those components into new columns in the same table. thank

+3


source to share


3 answers


Here's a possible solution stringi



library(stringi)
x <- c("P1F2I235R15C145","P24F1I12R124C96")
res <- stri_split_regex(x,"(?=([A-Za-z]=?))",perl = TRUE,simplify = TRUE,omit_empty = TRUE)
cbind.data.frame(x, res)
#                 x   1  2    3    4    5
# 1 P1F2I235R15C145  P1 F2 I235  R15 C145
# 2 P24F1I12R124C96 P24 F1  I12 R124  C96

      

+4


source


Try the following:

#simulate your data frame
df<-data.frame(code=c("P1F2I235R15C145","P24F1I12R124C96"),stringsAsFactors=FALSE)
#split the columns
cbind(df,do.call(rbind,regmatches(df$code,gregexpr("[PFIRC][0-9]+",df$code))))
#             code   1  2    3    4    5
#1 P1F2I235R15C145  P1 F2 I235  R15 C145
#2 P24F1I12R124C96 P24 F1  I12 R124  C96

      



What @AnandaMatho suggested in the comment was to let the letter in front of the code go away and name the columns appropriately. Something like that:

res<-cbind(df,do.call(rbind,regmatches(df$code,gregexpr("(?<=[PFIRC])[0-9]+",df$code,perl=TRUE))))
names(res)<-c("Code","P","F","I","R","C")
#             Code  P F   I   R   C
#1 P1F2I235R15C145  1 2 235  15 145
#2 P24F1I12R124C96 24 1  12 124  96

      

+3


source


A data.table

solution:

library(data.table)
dt<-data.table(code=c("P1F2I235R15C145","P24F1I12R124C96"))
dt[,c("P","F","I","R","C"):=
     lapply(c("P","F","I","R","C"),
            function(x)regmatches(code,regexpr(paste0(x,"[0-9]+"),code)))]

> dt
              code   P  F    I    R    C
1: P1F2I235R15C145  P1 F2 I235  R15 C145
2: P24F1I12R124C96 P24 F1  I12 R124  C96

      

And if you decide to drop the letters from the front, a little tweak:

dt[,c("P","F","I","R","C"):=
     lapply(c("P","F","I","R","C"),
            function(x)regmatches(code,regexpr(paste0("(?<=",x,")[0-9]+"),
                                               code,perl=T)))]
> dt
              code  P F   I   R   C
1: P1F2I235R15C145  1 2 235  15 145
2: P24F1I12R124C96 24 1  12 124  96

      


Or using the devel version data.table (v1.9.5+)

:

dt[, c("P", "F", "I", "R", "C") := 
      tstrsplit(code, "(?<=.)(?=[[:alpha:]][0-9]+)", perl=TRUE)]
#               code   P  F    I    R    C
# 1: P1F2I235R15C145  P1 F2 I235  R15 C145
# 2: P24F1I12R124C96 P24 F1  I12 R124  C96

      

+1


source