R regex find the last occurrence of a delimiter
I'm trying to get the endings of email addresses (i.e. .net, .com, .edu, etc.), but the part after @ can have multiple periods.
library(stringi)
strings1 <- c(
'test@aol.com',
'test@hotmail.com',
'test@xyz.rr.edu',
'test@abc.xx.zz.net'
)
list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))
> list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[[1]]
[1] "aol.com"
[[2]]
[1] "hotmail.com"
[[3]]
[1] "xyz.rr.edu"
[[4]]
[1] "abc.xx.zz.net"
Any suggestions to get something like this:
X1 X2 X3
1 test aol.com com
2 test hotmail.com com
3 test xyz.rr.edu edu
4 test abc.xx.zz.net net
EDIT: Another try:
> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."
source to share
Here are some approaches. The former seems particularly straightforward and the latter especially short.
1) sub This can be done with an application sub
in R to create each column:
data.frame(X1 = sub("@.*", "", strings1),
X2 = sub(".*@", "", strings1),
X3 = sub(".*[.]", "", strings1),
stringsAsFactors = FALSE)
giving:
X1 X2 X3
1 test aol.com com
2 test hotmail.com com
3 test xyz.rr.edu edu
4 test abc.xx.zz.net net
2) strapplyc Here's an alternative using the gsubfn package, which is especially short. This returns a matrix of characters. strappylyc
returns matches to parts of the pattern in parentheses. The first set of parentheses matches everything before @, the second set of parentheses matches everything after @, and the last set of parentheses matches everything after the last dot.
library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))
[,1] [,2] [,3]
[1,] "test" "aol.com" "com"
[2,] "test" "hotmail.com" "com"
[3,] "test" "xyz.rr.edu" "edu"
[4,] "test" "abc.xx.zz.net" "net"
2a) read.pattern read.pattern
also in package gsubfn can do this using the same pat
one defined in (2):
library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)
giving a data.frame similar to (1) except for the column names V1
, V2
and V3
.
3) strsplit Overlapping selections make it difficult to work with strsplit
, but we can do it with two applications strsplit
. The first one strsplit
splits into @, and the second uses everything up to the last dot to split. This last one strsplit
always creates a blank line as the first separating line and we remove it with [, -1]
. This gives a matrix of symbols:
ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )
giving the same answer as (2).
4) strsplit / sub This is a combination of (1) and (3):
cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))
giving the same answer as (2).
4a) This is another way to use strsplit
and sub
. Here we add @ followed by the TLD and then split into @.
do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))
giving the same answer as (2).
Update . Added additional solutions.
source to share
A read.table
+ file_ext
approach (not a regex, but pretty easy):
dat <- read.table(text=strings1, sep="@")
dat$V3 <- tools::file_ext(strings1)
dat
## V1 V2 V3
## 1 test aol.com com
## 2 test hotmail.com com
## 3 test xyz.rr.edu edu
## 4 test abc.xx.zz.net net
Here's a purely regular approach:
do.call(rbind, strsplit(strings1, "@|\\.(?=[^\\.]+$)", perl=TRUE))
## [,1] [,2] [,3]
## [1,] "test" "aol" "com"
## [2,] "test" "hotmail" "com"
## [3,] "test" "xyz.rr" "edu"
## [4,] "test" "abc.xx.zz" "net"
source to share