A regex to extract the first datetimestamp when multiple are present

Given a string with multiple date_time stamps, I would like to extract the first stamp along with the text preceding it

  • Candidate strings can have one or more time stamps
  • subsequent date_time stamps will be separated by character sep="-"

  • There may or may not be text between subsequent date_time stamps, but will definitely be sep

date_time format:

  • each individual stamp may or may not contain time (i.e. only date)
  • If the stamp has a time, the format will be either _HHMM

    or_HHMMSS

  • the date will always be in the format YYYYMMDD


library(stringr)  

string   <- "TEXT_etc_20140530-20140825_1635-"
expected <- "TEXT_etc_20140530"

## using this pattern for the date_time stamp
##  8 digits, optional underscore with 4to6 digits, appearing exactly once, followed by "-"
. (\\d{8}(_\\d{4,6})?){1}-    # I am not concerned with potential of a 5-digit time stamp

## Attempts
pat1 <- "(TEXT)(.*?)(\\d{8}(_\\d{4,6})?){1}-";  str_extract(string, pat=pat1)
pat2 <-            "(\\d{8}(_\\d{4,6})?){1}-";  str_extract(string, pat=pat2)  ## date is correct
pat3 <-       "(.*?)(\\d{8}(_\\d{4,6})?){1}-";  str_extract(string, pat=pat3)
pat4 <-       "(.*?)(\\d{8}){1}-"            ;  str_extract(string, pat=pat4)

## Other potential string patterns
string   <- "TEXT_etc_20140530-diff_txet_20140825_1635-"
string   <- "TEXT_etc_20140530_123456-diff_txet_20140825_1635-"

      

Can you help me spot the error in my regex?

Note to users not R

: R

requires the escape character to \

be escaped, hence \\

in the code above

+3


source to share


4 answers


Replace 8 digits followed by anything with these 8 digits:

# test data
string  <- c("TEXT_etc_20140530-20140825_1635-",
   "TEXT_etc_20140530-diff_txet_20140825_1635-",
   "TEXT_etc_20140530_123456-diff_txet_20140825_1635-")

sub("(\\d{8}).*", "\\1", string)
## [1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

      

If extra time needs to be saved, use instead:



sub("(\\d{8}(.\\d{4,6})?)\\b.*", "\\1", string)
## [1] "TEXT_etc_20140530"        "TEXT_etc_20140530"      
## [3] "TEXT_etc_20140530_123456"

      

Update Added a second solution and made a fix.

+5


source


What about

strings <- c("TEXT_etc_20140530-20140825_1635-",
    "TEXT_etc_20140530-diff_txet_20140825_1635-",
    "TEXT_etc_20140530_123456-diff_txet_20140825_1635-")

pat <- "^\\w*\\d{8}(_\\d{4,6})?"
str_extract(strings, pat=pat)

      



which returns

[1] "TEXT_etc_20140530"      "TEXT_etc_20140530"     "TEXT_etc_20140530_123456"

      

+3


source


This is one way:

pat <- '^(?U)(.*\\d{8}).*$'
gsub(pat, '\\1', string, perl=TRUE)
# [1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

      

(?U)

tells the parser to find the shortest match.

+2


source


You can also try:

 library(stringi)
 stri_extract_first_regex(string, "[^0-9]+\\d{8}")
 #[1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

      

or

 str_extract(string, "[^0-9]+\\d{8}")
 #[1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

      

To extract time:

 stri_extract_first_regex(string, "[^0-9]+\\d{8}(?:_[0-9]{4,6})?")
 #[1] "TEXT_etc_20140530"        "TEXT_etc_20140530"       
 #[3] "TEXT_etc_20140530_123456"


 #data 
 string  <- c("TEXT_etc_20140530-20140825_1635-",
"TEXT_etc_20140530-diff_txet_20140825_1635-",
"TEXT_etc_20140530_123456-diff_txet_20140825_1635-")

      

+1


source







All Articles