Substring identification based on complex rules

Suppose I have text strings that look something like this:

A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3

      

Here I want to identify sequences of tokens ( A

is a marker, I3

is a marker, etc.) that leads up to a subsequence consisting only ofmarkers IX

(i.e. I1

, I2

or I3

) that contains I3

. This subsequence can have a length of 1 (that is, be the only marker I3

), or it can be of unlimited length, but it must always contain a marker of at least 1 I3

and can only contain tokens IX

. In subsequence, which leads to a sub-sequence IX

can be included I1

, and I2

, never I3

.

In the above line, I need to define:

A-B-C-I1-I2-D-E-F

      

which results in a subsequence I1-I3

containingI3

and

D-D-D-D

      

which results in a subsequence of I1-I1-I2-I1-I1-I3-I3

at least 1 I3

.

Here are some additional examples:

A-B-I3-C-I3

      

from this line we have to identify A-B

because it is followed by subsequence 1 containing I3

, and also C

because it is followed by subsequence 1 containing I3

.

and

I3-A-I3

      

A

should be identified here because it is followed by subsequence 1, which contains I3

. The first I3

one will not be identified by itself, because we are only interested in subsequences followed by the subsequence of tokens IX

that contains I3

.

How can I write a generic function / regex that does this task?

+3


source to share


3 answers


Use strsplit

> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"

> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C" 

      



or

> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"

      

+4


source


You can identify sequences containing I3

with the following regular expression:

(?:I\\d-?)*I3(?:-?I\\d)*

      



So, you can split the text with this regex to get the desire result.

See demo https://regex101.com/r/bJ3iA3/4

+1


source


Try the following expression: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])*

. See Compliance Groups: https://regex101.com/r/yA6aV9/1

0


source







All Articles