Extract string between words using boolean operators in rm_between function

I am trying to extract lines between words. Consider this example -

x <-  "There are 2.3 million species in the world"

      

It can also have a different shape, which

x <-  "There are 2.3 billion species in the world"

      

I need text between There

and either ' million

or billion

, including them. The presence of a million or a billion is determined at runtime, it is not decided in advance. Therefore the conclusion I need from this sentence is

[1] There are 2.3 million

OR
[2] There are 2.3 billion

I am using a rm_between

function from the package qdapRegex

for this. Using this command, I can only extract one of them at a time.

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE) 

      

OR I have to use

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

      

How to write a command that can check for availability million

or billion

in the same sentence. Something like that -

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

      

Hope this is clear. Any help would be greatly appreciated.

+3


source to share


4 answers


Arguments left

and right

in rm_between

accept a vector

character / numeric character. This way you can use a vector of equal length in both arguments left/right

.

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

      

or



  sub('\\s*species.*', '', x)

      

data

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"

      

+3


source


You can use str_extact_all

(for global match) or str_extract

(single match)

library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")

      



or

str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))

      

+3


source


With, rm_between

you can provide a vector for multiple markers of equal length than document states.

EDIT

See @ TylerRinker answer for updated arguments for rm_between

.

Although, another method that you can use a user-defined regexp would be rm_default

:

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

      

Example :

library(qdapRegex)

x <-  c(
    'There are 2.3 million species in the world',
    'There are 2.3 billion species in the world'
)

rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"

## [[2]]
## [1] "There are 2.3 billion"

      

+2


source


@hwnd (my co-author qdapRegex ) inspired a discussion that led to a new argument fixed

, for rm_between

. The following description is in the dev version:

rm_between

and r_between_multiple

select an argument fixed

. Previously, borders left

and right

containing regex special characters were fixed by default (escaped). This prevented powerful regex for left / right borders. The behavior fixed = TRUE

is still the default, but users can now set fixed = FALSE

to work with regex boundaries. This new feature was inspired by @Ronak Shah's StackOverflow question: Extracting string between words using boolean operators in rm_between function

To install the dev version:

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

      

Using qdapRegex version> = 4.1 you can do the following.

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"

      

+2


source







All Articles