Select everything but a specific regex from a list in R

I want to replace everything from a list that does NOT match the given pattern. I am using R version 3.1.3 (2015-03-09) - "Smooth Sidewalk"

The list of examples I have:

y <- c("D CCNA_01234 This is example 1 bis", "D CCNA_02345 This is example 2", "D CCNA_12345 This is example 3", "D CCNA_23468 This is example 4")

      

and the pattern I want to match is CCNA_01234, where the numbers don't match in every case, but always have 5 digits.

Desired result:

"CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

so far i have removed the previous part to match:

y_begin_rm <- sub("D ", "", y)

      

but I have problems recognizing the match with the expression [^ match].

y_CCNA_numbers <- sub("[^CCNA_[0-9][0-9][0-9][0-9][0-9]]*$", "", y_begin_rm)

      

which outputs the result:

[1] "CCNA_01234 This is example 1 bis" "CCNA_02345 This is example 2"
[3] "CCNA_12345 This is example 3" "CCNA_23468 This is example 4"

      

The problem seems to be that the numbers given in the match appear all the way through the string and not in the exact combination I want. So the number after the phrase "this is an example" causes a lot of problems. When I omit the numbers, or place a digit that is only after the CCNA_string, it works just fine:

y_CCNA <- sub("[^CCNA_]*$", "", y_begin_rm)

      

performs

[1] "CCNA_" "CCNA_" "CCNA_" "CCNA_"

      

or

y_CCNA_0 <- sub("[^CCNA_0]*$", "", y_begin_rm[1])

      

leads to

[1] "CCNA_0"

      

Is there a way to specify the exact pattern I'm looking for (CCNA_ [0-9] [0-9] [0-9] [0-9] [0-9])? Also, is there a possible way to do this in one step (remove before and after match in one regex)?

Thanks in advance!

+3


source to share


3 answers


Here are some ways:

1) strapplyc . It uses a particularly simple pattern. It uses strapplyc

in the gsubfn package:

library(gsubfn)
strapplyc(y, "CCNA_\\d{5}", simplify = TRUE)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

Here's a visualization of the regex:

CCNA_\d{5}

      

Regular expression visualization

Demo Debuggex

1a) If the only occurrences of CCNA_ are up to 5 digits, we can simplify the previous solution a little as follows:



strapplyc(y, "CCNA_.{5}", simplify = TRUE)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

2) sub . The template is a little more complicated here, but with the help sub

we can do it without any package addons:

sub(".*(CCNA_\\d{5}).*", "\\1", y)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

3) strsplit If the desired part is always the second "word" (which is the case in the question) then this will work and again does not require packages:

sapply(strsplit(y, " "), "[", 2)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

4) substr If the desired part always contains characters 3 through 12, as in the question, then we could use substr

or substring

, again, without any packages:

substr(y, 3, 12)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

+5


source


With base R, you can just do straight from your original vector y

sub(".*(CCNA_\\d+).*", "\\1", y)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

Another option is to use stringi

library(stringi)
stri_extract_first_regex(y, "CCNA_\\d+")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

If you have more than one pattern CCNA

on each line, use stri_extract_all_regex

instead



If you want to exactly match the 5 digits after CCNA_

you can also do

stri_extract_first_regex(y, "CCNA_\\d{5}")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

And of course similarly stringr

library(stringr)
str_extract(y, "CCNA_\\d{5}")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

+5


source


Here's a package-using approach, I support qdapRegex (I prefer this or stringi / stringr) for consistency and ease of use. I am also showing the basic approach. In any case, I would look at this more as a "fetch" problem than a "sub all but" subbing problem.

y <- c("D CCNA_01234 This is example 1 bis", "D CCNA_02345 This is example 2", 
    "D CCNA_12345 This is example 3", "D CCNA_23468 This is example 4")

library(qdapRegex)
unlist(rm_default(y, pattern = "CCNA_\\d{5}", extract = TRUE))

## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

In the R base:

unlist(regmatches(y, gregexpr("CCNA_\\d{5}", y)))

## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

      

+4


source







All Articles