Substring identification based on complex rules
Suppose I have text strings that look something like this:
A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3
Here I want to identify sequences of tokens ( A
is a marker, I3
is a marker, etc.) that leads up to a subsequence consisting only ofmarkers IX
(i.e. I1
, I2
or I3
) that contains I3
. This subsequence can have a length of 1 (that is, be the only marker I3
), or it can be of unlimited length, but it must always contain a marker of at least 1 I3
and can only contain tokens IX
. In subsequence, which leads to a sub-sequence IX
can be included I1
, and I2
, never I3
.
In the above line, I need to define:
A-B-C-I1-I2-D-E-F
which results in a subsequence I1-I3
containingI3
and
D-D-D-D
which results in a subsequence of I1-I1-I2-I1-I1-I3-I3
at least 1 I3
.
Here are some additional examples:
A-B-I3-C-I3
from this line we have to identify A-B
because it is followed by subsequence 1 containing I3
, and also C
because it is followed by subsequence 1 containing I3
.
and
I3-A-I3
A
should be identified here because it is followed by subsequence 1, which contains I3
. The first I3
one will not be identified by itself, because we are only interested in subsequences followed by the subsequence of tokens IX
that contains I3
.
How can I write a generic function / regex that does this task?
source to share
Use strsplit
> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C"
or
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"
source to share
You can identify sequences containing I3
with the following regular expression:
(?:I\\d-?)*I3(?:-?I\\d)*
So, you can split the text with this regex to get the desire result.
See demo https://regex101.com/r/bJ3iA3/4
source to share
Try the following expression: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])*
. See Compliance Groups:
https://regex101.com/r/yA6aV9/1
source to share