Replace line from cache with part of this line

I've searched a lot of regex answers here but can't seem to find a solution to this problem.

My dataset is a slice with links to wikipedia:

library(tidytext)
library(stringr)
text.raw <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."

      

I am trying to clear my text from links. It:

str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])")
# [1] "Duits"     "architect"

      

Choosing the words I want between the brackets.

It:

str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# [1] "Berthold Speer was een Duits Duits."

      

works as expected, but not exactly what I need. It:

str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# Error: `replacement` must be a character vector

      

gives error where i expected "Berthold Speer was een Duits architect"

Currently my code looks something like this:

text.clean <- data_frame(text = text.raw) %>%
  mutate(text = str_replace_all(text, "\\[\\[.*?\\]\\]", str_extract_all(text, "[a-zA-Z\\s]+(?=\\])")))

      

I hope someone knows a solution or can point me to a duplicate question if it exists. My desired result: "Berthold Speer was een Duits architect"

.

+3


source to share


1 answer


You can use one gsub operation

text <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."
gsub("\\[{2}(?:[^]|]*\\|)?([^]]*)]{2}", "\\1", text)

      

See the online demonstration of the R .



The sample will match

  • \\[{2}

    - two characters [

  • (?:[^]|]*\\|)?

    - optional sequence matching
    • [^]|]*

      - zero or more characters, except ]

      and|

    • \\|

      - pipe symbol
  • ([^]]*)

    - Group 1: zero or more characters except ]

  • ]{2}

    - two symbols ]

    .
+5


source







All Articles