How to split a string between two characters into subgroups in R

Question

How to split a string between two characters into subgroups in R

I have a list of codes in the second column of a table and I want to extract some elements of each code and then store them in new columns associated with each of the codes. Each code consists of letters followed by some numbers. The letters P, F, I, R, C are repeated with the same order in all codes, but the number of digits changes in each code.

For example: consider the codes as below:

P1F2I235R15C145   P1   F2   I23   R15   C145
P24F1I12R124C96   P24  F1   I12   R124  C96

This way I can split each code into its subcodes and store those components into new columns in the same table. thank

+3

split regex matrix r extract

ashkan 02 june 15 at 10:35

source to share

3 answers

David Arenburg · Answer 1 · 2015-06-02T10:58:08+0000

Here's a possible solution stringi

library(stringi)
x <- c("P1F2I235R15C145","P24F1I12R124C96")
res <- stri_split_regex(x,"(?=([A-Za-z]=?))",perl = TRUE,simplify = TRUE,omit_empty = TRUE)
cbind.data.frame(x, res)
#                 x   1  2    3    4    5
# 1 P1F2I235R15C145  P1 F2 I235  R15 C145
# 2 P24F1I12R124C96 P24 F1  I12 R124  C96

nicola · Answer 2 · 2015-06-02T10:50:13+0000

Try the following:

#simulate your data frame
df<-data.frame(code=c("P1F2I235R15C145","P24F1I12R124C96"),stringsAsFactors=FALSE)
#split the columns
cbind(df,do.call(rbind,regmatches(df$code,gregexpr("[PFIRC][0-9]+",df$code))))
#             code   1  2    3    4    5
#1 P1F2I235R15C145  P1 F2 I235  R15 C145
#2 P24F1I12R124C96 P24 F1  I12 R124  C96

What @AnandaMatho suggested in the comment was to let the letter in front of the code go away and name the columns appropriately. Something like that:

res<-cbind(df,do.call(rbind,regmatches(df$code,gregexpr("(?<=[PFIRC])[0-9]+",df$code,perl=TRUE))))
names(res)<-c("Code","P","F","I","R","C")
#             Code  P F   I   R   C
#1 P1F2I235R15C145  1 2 235  15 145
#2 P24F1I12R124C96 24 1  12 124  96

MichaelChirico · Answer 3 · 2015-06-02T20:33:38+0000

A data.table

solution:

library(data.table)
dt<-data.table(code=c("P1F2I235R15C145","P24F1I12R124C96"))
dt[,c("P","F","I","R","C"):=
     lapply(c("P","F","I","R","C"),
            function(x)regmatches(code,regexpr(paste0(x,"[0-9]+"),code)))]

> dt
              code   P  F    I    R    C
1: P1F2I235R15C145  P1 F2 I235  R15 C145
2: P24F1I12R124C96 P24 F1  I12 R124  C96

And if you decide to drop the letters from the front, a little tweak:

dt[,c("P","F","I","R","C"):=
     lapply(c("P","F","I","R","C"),
            function(x)regmatches(code,regexpr(paste0("(?<=",x,")[0-9]+"),
                                               code,perl=T)))]
> dt
              code  P F   I   R   C
1: P1F2I235R15C145  1 2 235  15 145
2: P24F1I12R124C96 24 1  12 124  96

Or using the devel version data.table (v1.9.5+)

:

dt[, c("P", "F", "I", "R", "C") := 
      tstrsplit(code, "(?<=.)(?=[[:alpha:]][0-9]+)", perl=TRUE)]
#               code   P  F    I    R    C
# 1: P1F2I235R15C145  P1 F2 I235  R15 C145
# 2: P24F1I12R124C96 P24 F1  I12 R124  C96

How to split a string between two characters into subgroups in R

More articles: