R regex find the last occurrence of a delimiter

I'm trying to get the endings of email addresses (i.e. .net, .com, .edu, etc.), but the part after @ can have multiple periods.


strings1 <- c(

list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))

    > list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[1] "aol.com"

[1] "hotmail.com"

[1] "xyz.rr.edu"

[1] "abc.xx.zz.net"


Any suggestions to get something like this:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net


EDIT: Another try:

> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."



Here are some approaches. The former seems particularly straightforward and the latter especially short.

1) sub This can be done with an application sub

in R to create each column:

data.frame(X1 = sub("@.*", "", strings1), 
           X2 = sub(".*@", "", strings1), 
           X3 = sub(".*[.]", "", strings1), 
           stringsAsFactors = FALSE)



    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net


2) strapplyc Here's an alternative using the gsubfn package, which is especially short. This returns a matrix of characters. strappylyc

returns matches to parts of the pattern in parentheses. The first set of parentheses matches everything before @, the second set of parentheses matches everything after @, and the last set of parentheses matches everything after the last dot.

pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))

     [,1]   [,2]            [,3] 
[1,] "test" "aol.com"       "com"
[2,] "test" "hotmail.com"   "com"
[3,] "test" "xyz.rr.edu"    "edu"
[4,] "test" "abc.xx.zz.net" "net"


2a) read.pattern read.pattern

also in package gsubfn can do this using the same pat

one defined in (2):

pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)


giving a data.frame similar to (1) except for the column names V1

, V2

and V3


3) strsplit Overlapping selections make it difficult to work with strsplit

, but we can do it with two applications strsplit

. The first one strsplit

splits into @, and the second uses everything up to the last dot to split. This last one strsplit

always creates a blank line as the first separating line and we remove it with [, -1]

. This gives a matrix of symbols:

 ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
 cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )


giving the same answer as (2).

4) strsplit / sub This is a combination of (1) and (3):

cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))


giving the same answer as (2).

4a) This is another way to use strsplit

and sub

. Here we add @ followed by the TLD and then split into @.

do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))


giving the same answer as (2).

Update . Added additional solutions.



A read.table

+ file_ext

approach (not a regex, but pretty easy):

dat <- read.table(text=strings1, sep="@")
dat$V3 <- tools::file_ext(strings1)

##     V1            V2  V3
## 1 test       aol.com com
## 2 test   hotmail.com com
## 3 test    xyz.rr.edu edu
## 4 test abc.xx.zz.net net


Here's a purely regular approach:

do.call(rbind, strsplit(strings1, "@|\\.(?=[^\\.]+$)", perl=TRUE))

##     [,1]   [,2]        [,3] 
## [1,] "test" "aol"       "com"
## [2,] "test" "hotmail"   "com"
## [3,] "test" "xyz.rr"    "edu"
## [4,] "test" "abc.xx.zz" "net"




So this is the negative lookahead regex

that should give you the last of .word

this line.





Solution using basic regex, assuming df1 $ X2 is a character vector:

df1 <- cbind(df1, X3 = regmatches(df1$X2, regexpr('\\.[A-Z|a-z]*$', df1$X2)))
df1$X3 <- gsub("\\.", "", df1$X3)




