R: extract a list of matching parts of a string through a regular expression

Let's say that I need to extract different parts from a string as a list, for example I would like to split the string "aaa12xxx"

into three parts.

One possibility is to make three calls gsub

:

parts = c()
parts[1] = gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\1', "aaa12xxx")
parts[2] = gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\2', "aaa12xxx")
parts[3] = gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\3', "aaa12xxx")

      

Of course, this seems like a pretty waste of time (even if it's inside a loop for

). Isn't there a function that just returns a list of parts from a regex and a test string?

+3


source to share


2 answers


Just split the input string by strsplit

and get the details you want.

> x <- "aaa12xxx"
> strsplit(x,"(?<=[[:alpha:]])(?=\\d)|(?<=\\d)(?=[[:alpha:]])", perl=TRUE)
[[1]]
[1] "aaa" "12"  "xxx"

      

Get details by quoting the zip code.



> m <- unlist(strsplit(x,"(?<=[[:alpha:]])(?=\\d)|(?<=\\d)(?=[[:alpha:]])", perl=TRUE))
> m[1]
[1] "aaa"
> m[2]
[1] "12"
> m[3]
[1] "xxx"

      

  • (?<=[[:alpha:]])(?=\\d)

    Matches all boundaries preceded by an alphabet and then a number.

  • |

    OR

  • (?<=\\d)(?=[[:alpha:]])

    Matches all boundaries preceded by a number and then the alphabet.

  • Splitting your input according to the agreed boundaries will give you the output you want.

+4


source


(\\d+)|([a-zA-Z]+)

      

or

([[:alpha:]]+)|([0-9]+)

      



You can just grab capture.use str_match_all()

from library(stringr)

. See demo.

https://regex101.com/r/fA6wE2/8

+3


source







All Articles