Let grep in R from processing "." like a letter
I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before .
:
> "xYz" "ge" "qrstu"
The function grep
, however , seems to treat .
like a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, like regexpal .
How can I get it grep
to behave as expected?
source to share
grep
is designed to find a template. It returns the index of the vector that matches the pattern. If specified value=TRUE
, it returns a value. From the description it sounds like you want to remove the substring instead of returning a subset of the original vector.
If you need to remove a substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match the pattern, i.e. '.*\\.'
... It matches one of several characters ( .*
) followed by a period ( \\.
). \\
is needed to exit .
to treat it as a character instead of a character. This will match the last character .
in the line. We replace this matched pattern ''
as a replacement argument and thereby remove the substring.
source to share
grep
does not perform any substitutions. It searches for matches and returns the indices (or value if you specify value = T) that give the match. The results you get just say they meet your criteria at some point in the line. If you added something that doesn't match the criteria somewhere in your text vector (ex: "9", "# $% 23", ...) then it won't return them when you grep on it.
If you just want to return the matched part, you should look at the function regmatches
. However, for your purposes, it seems like sub
or gsub
should be doing what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regular expressions ?regex
. The wikipedia page is decent reading as well, but note that R regular expressions are slightly different from others. https://en.wikipedia.org/wiki/Regular_expression
source to share
Your template is working, the problem is that it grep
does something different than what you think it is.
Let's use your template str_extract_all
from the package first stringr
.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Please note that the results came as you expected!
The problem you are having is that grep
will give you the complete element that matches your regex, not just the matching part of the element. For example, in the example below, grep will return you the first element since it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"
source to share