Stringr: extract words containing a specific word

Consider this simple example

dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
                                 'WUFF;WEFF;WIFF2;BIGWIFF'))

> dataframe
# A tibble: 2 x 1
                      text
                     <chr>
1 WAFF;WOFF;WIFF200;WIFF12
2  WUFF;WEFF;WIFF2;BIGWIFF

      

Here I want to extract words containing WIFF

, that is, I want to get a data file like this

> output
# A tibble: 2 x 1
            text
           <chr>
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

      

I tried to use

dataframe %>% 
  mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))

      

but that only echoes NA. Any ideas?

Thank!

+3


source to share


2 answers


You seem to want to remove all the containing WIFF

and trailing words ;

, if any. Use

> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

      

Sample (?i)\\b(?!\\w*WIFF)\\w+;?

matches:

  • (?i)

    - case insensitive inline modifier
  • \\b

    - word boundary
  • (?!\\w*WIFF)

    - negative lookahead fails on any match where the word contains WIFF

    anywhere inside it
  • \\w+

    - 1 or more word characters
  • ;?

    - optional ;

    ( ?

    matches 1 or 0 occurrences of the pattern it modifies)


If for some reason you want to use str_extract

, please note that your regex cannot work because it \bWIFF\b

matches a whole WIFF word
and nothing else. You don't have such words in your DF. You can use "(?i)\\b\\w*WIFF\\w*\\b"

to match any word with WIFF

inside (case insensitive) and use str_extract_all

to get multiple occurrences and don't forget to join the matches in one "string":

> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12" 

[[2]]
[1] "WIFF2"   "BIGWIFF"

> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

      

You can "condense" the code str_extract_all

into a function sapply

, I have separated them for better visibility.

+2


source


The classic non-regex approach via base R would be,



sapply(strsplit(me$text, ';', fixed = TRUE), function(i) 
                              paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))

#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF" 

      

+3


source







All Articles