R regex find the last occurrence of a delimiter

Question

R regex find the last occurrence of a delimiter

I'm trying to get the endings of email addresses (i.e. .net, .com, .edu, etc.), but the part after @ can have multiple periods.

library(stringi)

strings1 <- c(
    'test@aol.com',
    'test@hotmail.com',
    'test@xyz.rr.edu',
    'test@abc.xx.zz.net'
)

list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))

    > list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[[1]]
[1] "aol.com"

[[2]]
[1] "hotmail.com"

[[3]]
[1] "xyz.rr.edu"

[[4]]
[1] "abc.xx.zz.net"

Any suggestions to get something like this:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

EDIT: Another try:

> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."

+3

string regex r

screechOwl 09 oct. 14 at 23:38

source to share

4 answers

A read.table

+ file_ext

approach (not a regex, but pretty easy):

dat <- read.table(text=strings1, sep="@")
dat$V3 <- tools::file_ext(strings1)
dat

##     V1            V2  V3
## 1 test       aol.com com
## 2 test   hotmail.com com
## 3 test    xyz.rr.edu edu
## 4 test abc.xx.zz.net net

Here's a purely regular approach:

do.call(rbind, strsplit(strings1, "@|\\.(?=[^\\.]+$)", perl=TRUE))

##     [,1]   [,2]        [,3] 
## [1,] "test" "aol"       "com"
## [2,] "test" "hotmail"   "com"
## [3,] "test" "xyz.rr"    "edu"
## [4,] "test" "abc.xx.zz" "net"

+9

Tyler rinker 09 oct. 14 at 23:55

source to share

So this is the negative lookahead regex

that should give you the last of .word

this line.

\.(?!.*\.)\w+

0

Jay 09 oct. 14 at 11:50 pm

source to share

Solution using basic regex, assuming df1 $ X2 is a character vector:

df1 <- cbind(df1, X3 = regmatches(df1$X2, regexpr('\\.[A-Z|a-z]*$', df1$X2)))
df1$X3 <- gsub("\\.", "", df1$X3)

0

Sean murphy 10 oct. 14 at 12:24 am

source to share

G. Grothendieck · Accepted Answer · 2014-10-09T23:43:11+0000

Here are some approaches. The former seems particularly straightforward and the latter especially short.

1) sub This can be done with an application sub

in R to create each column:

data.frame(X1 = sub("@.*", "", strings1), 
           X2 = sub(".*@", "", strings1), 
           X3 = sub(".*[.]", "", strings1), 
           stringsAsFactors = FALSE)

giving:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

2) strapplyc Here's an alternative using the gsubfn package, which is especially short. This returns a matrix of characters. strappylyc

returns matches to parts of the pattern in parentheses. The first set of parentheses matches everything before @, the second set of parentheses matches everything after @, and the last set of parentheses matches everything after the last dot.

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))

     [,1]   [,2]            [,3] 
[1,] "test" "aol.com"       "com"
[2,] "test" "hotmail.com"   "com"
[3,] "test" "xyz.rr.edu"    "edu"
[4,] "test" "abc.xx.zz.net" "net"

2a) read.pattern read.pattern

also in package gsubfn can do this using the same pat

one defined in (2):

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)

giving a data.frame similar to (1) except for the column names V1

, V2

and V3

.

3) strsplit Overlapping selections make it difficult to work with strsplit

, but we can do it with two applications strsplit

. The first one strsplit

splits into @, and the second uses everything up to the last dot to split. This last one strsplit

always creates a blank line as the first separating line and we remove it with [, -1]

. This gives a matrix of symbols:

 ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
 cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )

giving the same answer as (2).

4) strsplit / sub This is a combination of (1) and (3):

cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))

giving the same answer as (2).

4a) This is another way to use strsplit

and sub

. Here we add @ followed by the TLD and then split into @.

do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))

giving the same answer as (2).

Update . Added additional solutions.

R regex find the last occurrence of a delimiter

More articles: