Stringr: extract words containing a specific word

Question

Stringr: extract words containing a specific word

Consider this simple example

dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
                                 'WUFF;WEFF;WIFF2;BIGWIFF'))

> dataframe
# A tibble: 2 x 1
                      text
                     <chr>
1 WAFF;WOFF;WIFF200;WIFF12
2  WUFF;WEFF;WIFF2;BIGWIFF

Here I want to extract words containing WIFF

, that is, I want to get a data file like this

> output
# A tibble: 2 x 1
            text
           <chr>
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

I tried to use

dataframe %>% 
  mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))

but that only echoes NA. Any ideas?

Thank!

+3

regex r stringr

ℕʘʘḆḽḘ 18 jul. 17 at 13:06

source to share

2 answers

The classic non-regex approach via base R would be,

sapply(strsplit(me$text, ';', fixed = TRUE), function(i) 
                              paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))

#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"

+3

Sotos 18 jul. 17 at 13:14

source to share

Wiktor Stribiżew · Accepted Answer · 2017-07-18T13:11:38+0000

You seem to want to remove all the containing WIFF

and trailing words ;

, if any. Use

> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

Sample (?i)\\b(?!\\w*WIFF)\\w+;?

matches:

(?i)

- case insensitive inline modifier
\\b

- word boundary
(?!\\w*WIFF)

- negative lookahead fails on any match where the word contains WIFF

anywhere inside it
\\w+

- 1 or more word characters
;?

- optional ;

( ?

matches 1 or 0 occurrences of the pattern it modifies)

If for some reason you want to use str_extract

, please note that your regex cannot work because it \bWIFF\b

matches a whole WIFF word and nothing else. You don't have such words in your DF. You can use "(?i)\\b\\w*WIFF\\w*\\b"

to match any word with WIFF

inside (case insensitive) and use str_extract_all

to get multiple occurrences and don't forget to join the matches in one "string":

> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12" 

[[2]]
[1] "WIFF2"   "BIGWIFF"

> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

You can "condense" the code str_extract_all

into a function sapply

, I have separated them for better visibility.

Stringr: extract words containing a specific word

More articles: