A regex to extract the first datetimestamp when multiple are present
Given a string with multiple date_time stamps, I would like to extract the first stamp along with the text preceding it
- Candidate strings can have one or more time stamps
- subsequent date_time stamps will be separated by character
sep="-"
- There may or may not be text between subsequent date_time stamps, but will definitely be sep
date_time format:
- each individual stamp may or may not contain time (i.e. only date)
- If the stamp has a time, the format will be either
_HHMM
or_HHMMSS
- the date will always be in the format
YYYYMMDD
library(stringr)
string <- "TEXT_etc_20140530-20140825_1635-"
expected <- "TEXT_etc_20140530"
## using this pattern for the date_time stamp
## 8 digits, optional underscore with 4to6 digits, appearing exactly once, followed by "-"
. (\\d{8}(_\\d{4,6})?){1}- # I am not concerned with potential of a 5-digit time stamp
## Attempts
pat1 <- "(TEXT)(.*?)(\\d{8}(_\\d{4,6})?){1}-"; str_extract(string, pat=pat1)
pat2 <- "(\\d{8}(_\\d{4,6})?){1}-"; str_extract(string, pat=pat2) ## date is correct
pat3 <- "(.*?)(\\d{8}(_\\d{4,6})?){1}-"; str_extract(string, pat=pat3)
pat4 <- "(.*?)(\\d{8}){1}-" ; str_extract(string, pat=pat4)
## Other potential string patterns
string <- "TEXT_etc_20140530-diff_txet_20140825_1635-"
string <- "TEXT_etc_20140530_123456-diff_txet_20140825_1635-"
Can you help me spot the error in my regex?
Note to users not R
: R
requires the escape character to \
be escaped, hence \\
in the code above
+3
source to share
4 answers
Replace 8 digits followed by anything with these 8 digits:
# test data
string <- c("TEXT_etc_20140530-20140825_1635-",
"TEXT_etc_20140530-diff_txet_20140825_1635-",
"TEXT_etc_20140530_123456-diff_txet_20140825_1635-")
sub("(\\d{8}).*", "\\1", string)
## [1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"
If extra time needs to be saved, use instead:
sub("(\\d{8}(.\\d{4,6})?)\\b.*", "\\1", string)
## [1] "TEXT_etc_20140530" "TEXT_etc_20140530"
## [3] "TEXT_etc_20140530_123456"
Update Added a second solution and made a fix.
+5
source to share
What about
strings <- c("TEXT_etc_20140530-20140825_1635-",
"TEXT_etc_20140530-diff_txet_20140825_1635-",
"TEXT_etc_20140530_123456-diff_txet_20140825_1635-")
pat <- "^\\w*\\d{8}(_\\d{4,6})?"
str_extract(strings, pat=pat)
which returns
[1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530_123456"
+3
source to share
You can also try:
library(stringi)
stri_extract_first_regex(string, "[^0-9]+\\d{8}")
#[1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"
or
str_extract(string, "[^0-9]+\\d{8}")
#[1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"
To extract time:
stri_extract_first_regex(string, "[^0-9]+\\d{8}(?:_[0-9]{4,6})?")
#[1] "TEXT_etc_20140530" "TEXT_etc_20140530"
#[3] "TEXT_etc_20140530_123456"
#data
string <- c("TEXT_etc_20140530-20140825_1635-",
"TEXT_etc_20140530-diff_txet_20140825_1635-",
"TEXT_etc_20140530_123456-diff_txet_20140825_1635-")
+1
source to share