Let grep in R from processing "." like a letter

I have a character vector that contains text similar to the following:

text <- c("ABc.def.xYz", "ge", "lmo.qrstu")

      

I would like to remove everything before .

:

> "xYz" "ge" "qrstu"

      

The function grep

, however , seems to treat .

like a letter:

pattern <- "([A-Z]|[a-z])+$"

grep(pattern, text, value = T)

> "ABc.def.xYz" "ge"          "lmo.qrstu" 

      

The pattern works elsewhere, like regexpal .

How can I get it grep

to behave as expected?

+3


source to share


4 answers


grep

is designed to find a template. It returns the index of the vector that matches the pattern. If specified value=TRUE

, it returns a value. From the description it sounds like you want to remove the substring instead of returning a subset of the original vector.

If you need to remove a substring, you can use sub



 sub('.*\\.', '', text)
 #[1] "xYz"   "ge"    "qrstu"

      

As the first argument, we match the pattern, i.e. '.*\\.'

... It matches one of several characters ( .*

) followed by a period ( \\.

). \\

is needed to exit .

to treat it as a character instead of a character. This will match the last character .

in the line. We replace this matched pattern ''

as a replacement argument and thereby remove the substring.

+5


source


grep

does not perform any substitutions. It searches for matches and returns the indices (or value if you specify value = T) that give the match. The results you get just say they meet your criteria at some point in the line. If you added something that doesn't match the criteria somewhere in your text vector (ex: "9", "# $% 23", ...) then it won't return them when you grep on it.

If you just want to return the matched part, you should look at the function regmatches

. However, for your purposes, it seems like sub

or gsub

should be doing what you want.



gsub(".*\\.", "", text)

      

I would suggest reading the help page for regular expressions ?regex

. The wikipedia page is decent reading as well, but note that R regular expressions are slightly different from others. https://en.wikipedia.org/wiki/Regular_expression

+5


source


You can try the str_extract

function from the package stringr

.

str_extract(text, "[^.]*$")

      

This will match all non-dot characters in the past.

+3


source


Your template is working, the problem is that it grep

does something different than what you think it is.

Let's use your template str_extract_all

from the package first stringr

.

library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"

[[2]]
[1] "ge"

[[3]]
[1] "qrstu"

      

Please note that the results came as you expected!

The problem you are having is that grep

will give you the complete element that matches your regex, not just the matching part of the element. For example, in the example below, grep will return you the first element since it matches "a":

grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

      

+2


source







All Articles