Replace line from cache with part of this line
I've searched a lot of regex answers here but can't seem to find a solution to this problem.
My dataset is a slice with links to wikipedia:
library(tidytext)
library(stringr)
text.raw <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."
I am trying to clear my text from links. It:
str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])")
# [1] "Duits" "architect"
Choosing the words I want between the brackets.
It:
str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# [1] "Berthold Speer was een Duits Duits."
works as expected, but not exactly what I need. It:
str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# Error: `replacement` must be a character vector
gives error where i expected "Berthold Speer was een Duits architect"
Currently my code looks something like this:
text.clean <- data_frame(text = text.raw) %>%
mutate(text = str_replace_all(text, "\\[\\[.*?\\]\\]", str_extract_all(text, "[a-zA-Z\\s]+(?=\\])")))
I hope someone knows a solution or can point me to a duplicate question if it exists. My desired result: "Berthold Speer was een Duits architect"
.
+3
source to share
1 answer
You can use one gsub operation
text <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."
gsub("\\[{2}(?:[^]|]*\\|)?([^]]*)]{2}", "\\1", text)
See the online demonstration of the R .
The sample will match
-
\\[{2}
- two characters[
-
(?:[^]|]*\\|)?
- optional sequence matching-
[^]|]*
- zero or more characters, except]
and|
-
\\|
- pipe symbol
-
-
([^]]*)
- Group 1: zero or more characters except]
-
]{2}
- two symbols]
.
+5
source to share