Selecting a word immediately after a keyword
I'm trying to extract a word at once with a keyword using R. I don't have much experience with regex, so everything I've found so far doesn't help me much. If I could get the function to return multiple instances that would be perfect.
For example, if my keyword was the
and my string was:
The yellow log is in the stream
He will return yellow
and stream
.
I found this solution for C # and it looks like what I want, but I'm having trouble implementing it in R.
source to share
You may try
library(stringr)
str_extract_all(str1, perl('(?<=\\b(?i)The )\\w+'))[[1]]
#[1] "yellow" "stream"
Or using stringi
library(stringi)
stri_extract_all_regex(str1, '(?<=\\b(?i)The )\\w+')[[1]]
#[1] "yellow" "stream"
EDIT: Modified based on @ Roland's suggestion in the comments.
data
str1 <- 'The yellow log is in the stream'
source to share
assign key
whatever string you want and use
key <- 'the'
p <- "The yellow log is in the stream"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
or, as @Roland points out, it would be safer to use a word boundary around your keyword to avoid this:
key <- 'the'
p <- "The yellow log is in the stream drinking absinthe and beer"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream" "and"
regmatches(p, gregexpr(sprintf('(?i)(?<=\\b%s )\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
source to share
The qdapRegex package that I maintain has a regular expression after_
in the dictionary regex_supplement
which is perfect for this. You can use rm_
to create your own function after_the
:
library(qdapRegex)
x<- "The yellow log is in the stream"
after_the <- rm_(pattern = S("@after_", "[Tt]he"), extract = TRUE)
after_the(x)
## [[1]]
## [1] "yellow" "stream"
The function S
is a wrapper around sprintf
that allows you to easily pass elements (eg work "in this case") into the underlying regex, creating:
S("@after_", "the", "The")
## [1] "(?<=\\b(the|The)\\s)(\\w+)"
EDIT
library(qdapRegex)
x<- c("The yellow log is in the stream", "I like the one box for a pack")
after_ <- rm_(extract = TRUE)
after_the(x)
after_ <- rm_(extract = TRUE)
words <- c("the", "a", "one")
setNames(lapply(words, function(y){
after_(x, pattern = S("@after_", y, TC(y)))
}), words)
## $the
## $the[[1]]
## [1] "yellow" "stream"
##
## $the[[2]]
## [1] "one"
##
##
## $a
## $a[[1]]
## [1] NA
##
## $a[[2]]
## [1] "pack"
##
##
## $one
## $one[[1]]
## [1] NA
##
## $one[[2]]
## [1] "box"
source to share