Extract character before first point in string
I would like to extract the character preceding the first point in a column of strings. I can do it with the code below. Although, the code seems too complex, and I had to resort to for-loop
. Is there an easier way? I'm especially interested in the solution regex
.
Note that finding the last number in each line will not work with my real data, although this approach will work with this example.
Thanks for any advice.
my.data <- read.table(text = '
my.string state
......... A
1........ B
112...... C
11111.... D
1111113.. E
111111111 F
111111111 G
', header = TRUE, stringsAsFactors = FALSE)
desired.result <- c(NA,1,2,1,3,NA,NA)
Determine the position of the first point:
my.data$first.dot <- apply(my.data, 1, function(x) {
as.numeric(gregexpr("\\.", x['my.string'])[[1]])[1]
})
Separated lines:
split.strings <- t(apply(my.data, 1, function(x) { (strsplit(x['my.string'], '')[[1]]) } ))
my.data$revised.first.dot <- ifelse(my.data$first.dot < 2, NA, my.data$first.dot-1)
Extract the character preceding the first dot:
for(i in 1:nrow(my.data)) {
my.data$character.before.dot[i] <- split.strings[i,my.data$revised.first.dot[i]]
}
my.data
# my.string state first.dot revised.first.dot character.before.dot
# 1 ......... A 1 NA <NA>
# 2 1........ B 2 1 1
# 3 112...... C 4 3 2
# 4 11111.... D 6 5 1
# 5 1111113.. E 8 7 3
# 6 111111111 F -1 NA <NA>
# 7 111111111 G -1 NA <NA>
Here is a related post:
source to share
Use the regex below and don't forget to include the parameter perl=TRUE
.
^[^.]*?\K[^.](?=\.)
In R, the regex would look like
^[^.]*?\\K[^.](?=\\.)
> library(stringr)
> as.numeric(str_extract(my.data$my.string, perl("^[^.]*?\\K[^.](?=\\.)")))
[1] NA 1 2 1 3 NA NA
Sample Explanation:
-
^
It is stated that we are at the beginning. -
[^.]*?
An unwanted match of any character with the first dot. -
\K
Discards previously matched characters. -
[^.]
The symbol we are going to match does not have to be a dot. -
(?=\.)
And this character must be followed by a period. Thus, it matches the character that exists immediately before the first point.
source to share
The simplest regular expression would be ^([^.])+(?=\.)
:
^ # Start of string
( # Start of group 1
[^.] # Match any character except .
)+ # Repeat as many times as needed, overwriting the previous match
(?=\.) # Assert the next character is a .
Test it live at regex101.com .
The content of group 1 will be your desired symbol. I don't really like the guy, but according to RegexBuddy the following should work:
matches <- regexpr("^([^.])+(?=\\.)", my.data, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(my.data, result)
source to share
In this example, everything digits
and.
library(stringr)
as.numeric(str_extract(my.data$my.string, perl('\\d(?=\\.)')))
#[1] NA 1 2 1 3 NA NA
Or using stringi
library(stringi)
as.numeric(stri_extract(my.data$my.string, regex='\\d(?=\\.)'))
#[1] NA 1 2 1 3 NA NA
If this is for the case general
:
as.numeric(str_extract(my.data$my.string, perl('[^.](?=\\.)')))
source to share
Using rex can make this type of task a little easier.
my.data <- read.table(text = '
my.string state
......... A
1........ B
112...... C
11111.... D
1111113.. E
111111111 F
111111111 G
', header = TRUE, stringsAsFactors = FALSE)
library(rex)
re_matches(my.data$my.string,
rex(capture(except(".")), "."))$'1'
#> [1] NA "1" "2" "1" "3" NA NA
source to share