Grepl repetition of upper and additional characters
I am working on a dataframe ( df
) that looks like this and can be over 10,000 rows for some cases.
Object Coding Fn Remaining
1 T00055 T 00055_005_<002_+ 2 30
2 T00055 T 00055_008_<002_+ 2 30
3 E00336 E 00336_041_<001_+001_+ 3 0
4 E00336 E 00336_041_<001_+001_+001_+ 4 10
5 E00336 E 00336_041_<001_+001_+002_+ 4 56
6 E00336 E 00336_041_<001_+001_+002_< 4 52
7 T 00054 T 00054_013_<003_<015_+003_<001_< 4 52
I need grep
all rows containing at least twice _+
in a column row Coding
to get a data frame test
.
I'm trying to:
test<-filter(df,
grepl("_[+].{2,}",Coding))
which cannot exclude the last line. Any idea why? Many thanks
here are the results:
Object Coding Fn Remaining
1 E00336 E 00336_041_<001_+001_+ 3 0
2 E00336 E 00336_041_<001_+001_+001_+ 4 10
3 E00336 E 00336_041_<001_+001_+002_+ 4 56
4 E00336 E 00336_041_<001_+001_+002_< 4 52
5 T 00054 T 00054_013_<003_<015_+003_<001_< 4 52
source to share
Using rex can make this type of task a little easier.
df <- structure(list(Object = c("T00055", "T00055", "E00336", "E00336",
"E00336", "E00336", "T 00054"), Coding = c("T 00055_005_<002_+",
"T 00055_008_<002_+", "E 00336_041_<001_+001_+", "E 00336_041_<001_+001_+001_+",
"E 00336_041_<001_+001_+002_+", "E 00336_041_<001_+001_+002_<",
"T 00054_013_<003_<015_+003_<001_<"), Fn = c(2L, 2L, 3L, 4L,
4L, 4L, 4L), Remaining = c(30L, 30L, 0L, 10L, 56L, 52L, 52L)), .Names = c("Object",
"Coding", "Fn", "Remaining"), row.names = c(NA, -7L), class = "data.frame")
subset(df, grepl(rex(at_least(group("_+", anything), 2)), Coding))
#> Object Coding Fn Remaining
#> 3 E00336 E 00336_041_<001_+001_+ 3 0
#> 4 E00336 E 00336_041_<001_+001_+001_+ 4 10
#> 5 E00336 E 00336_041_<001_+001_+002_+ 4 56
#> 6 E00336 E 00336_041_<001_+001_+002_< 4 52
source to share
You can use this command:
subset(df, grepl("(_\\+.*){2,}", Coding))
or, with dplyr
,
filter(df, grepl("(_\\+.*){2,}", Coding))
Your current regex,, "_[+].{2,}"
matches _+
, followed by at least two characters. You need to create a group using parentheses in order to apply the quantifier correctly.
source to share