Extract string between words using boolean operators in rm_between function

Question

Extract string between words using boolean operators in rm_between function

I am trying to extract lines between words. Consider this example -

x <-  "There are 2.3 million species in the world"

It can also have a different shape, which

x <-  "There are 2.3 billion species in the world"

I need text between There

and either ' million

or billion

, including them. The presence of a million or a billion is determined at runtime, it is not decided in advance. Therefore the conclusion I need from this sentence is

[1] There are 2.3 million

OR
[2] There are 2.3 billion

I am using a rm_between

function from the package qdapRegex

for this. Using this command, I can only extract one of them at a time.

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE)

OR I have to use

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

How to write a command that can check for availability million

or billion

in the same sentence. Something like that -

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

Hope this is clear. Any help would be greatly appreciated.

+3

string r qdapregex

Ronak Shah Jul 25 15 at 4:52

source to share

4 answers

You can use str_extact_all

(for global match) or str_extract

(single match)

library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")

or

str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))

+3

Avinash Raj Jul 25 At 4:55 am

source to share

With, rm_between

you can provide a vector for multiple markers of equal length than document states.

EDIT

See @ TylerRinker answer for updated arguments for rm_between

.

Although, another method that you can use a user-defined regexp would be rm_default

:

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

Example :

library(qdapRegex)

x <-  c(
    'There are 2.3 million species in the world',
    'There are 2.3 billion species in the world'
)

rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"

## [[2]]
## [1] "There are 2.3 billion"

+2

hwnd Jul 25 15 at 5:35 am

source to share

@hwnd (my co-author qdapRegex ) inspired a discussion that led to a new argument fixed

, for rm_between

. The following description is in the dev version:

rm_between

and r_between_multiple

select an argument fixed

. Previously, borders left

and right

containing regex special characters were fixed by default (escaped). This prevented powerful regex for left / right borders. The behavior fixed = TRUE

is still the default, but users can now set fixed = FALSE

to work with regex boundaries. This new feature was inspired by @Ronak Shah's StackOverflow question: Extracting string between words using boolean operators in rm_between function

To install the dev version:

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

Using qdapRegex version> = 4.1 you can do the following.

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"

+2

Tyler rinker Jul 25 15 at 16:45

source to share

akrun · Accepted Answer · 2015-07-25T04:56:24+0000

Arguments left

and right

in rm_between

accept a vector

character / numeric character. This way you can use a vector of equal length in both arguments left/right

.

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

or

  sub('\\s*species.*', '', x)

data

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"

Extract string between words using boolean operators in rm_between function

data

EDIT

More articles: