Extract string between words using boolean operators in rm_between function
I am trying to extract lines between words. Consider this example -
x <- "There are 2.3 million species in the world"
It can also have a different shape, which
x <- "There are 2.3 billion species in the world"
I need text between There
and either ' million
or billion
, including them. The presence of a million or a billion is determined at runtime, it is not decided in advance. Therefore the conclusion I need from this sentence is
[1] There are 2.3 million
OR [2] There are 2.3 billion
I am using a rm_between
function from the package qdapRegex
for this. Using this command, I can only extract one of them at a time.
library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE)
OR I have to use
rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)
How to write a command that can check for availability million
or billion
in the same sentence. Something like that -
rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)
Hope this is clear. Any help would be greatly appreciated.
source to share
Arguments left
and right
in rm_between
accept a vector
character / numeric character. This way you can use a vector of equal length in both arguments left/right
.
library(qdapRegex)
unlist(rm_between(x, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 million" "There are 2.3 billion"
unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 million"
unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 billion"
or
sub('\\s*species.*', '', x)
data
x <- c("There are 2.3 million species in the world",
"There are 2.3 billion species in the world")
x1 <- "There are 2.3 million species in the world"
x2 <- "There are 2.3 billion species in the world"
source to share
With, rm_between
you can provide a vector for multiple markers of equal length than document states.
EDIT
See @ TylerRinker answer for updated arguments for rm_between
.
Although, another method that you can use a user-defined regexp would be rm_default
:
rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)
Example :
library(qdapRegex)
x <- c(
'There are 2.3 million species in the world',
'There are 2.3 billion species in the world'
)
rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)
## [[1]]
## [1] "There are 2.3 million"
## [[2]]
## [1] "There are 2.3 billion"
source to share
@hwnd (my co-author qdapRegex ) inspired a discussion that led to a new argument fixed
, for rm_between
. The following description is in the dev version:
rm_between
andr_between_multiple
select an argumentfixed
. Previously, bordersleft
andright
containing regex special characters were fixed by default (escaped). This prevented powerful regex for left / right borders. The behaviorfixed = TRUE
is still the default, but users can now setfixed = FALSE
to work with regex boundaries. This new feature was inspired by @Ronak Shah's StackOverflow question: Extracting string between words using boolean operators in rm_between function
To install the dev version:
if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")
Using qdapRegex version> = 4.1 you can do the following.
x <- c(
"There are 2.3 million species in the world",
"There are 2.3 billion species in the world"
)
rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
include=TRUE, extract = TRUE)
## [[1]]
## [1] "There are 2.3 million"
##
## [[2]]
## [1] "There are 2.3 billion"
source to share