A regex to extract the first datetimestamp when multiple are present

Question

A regex to extract the first datetimestamp when multiple are present

Given a string with multiple date_time stamps, I would like to extract the first stamp along with the text preceding it

Candidate strings can have one or more time stamps
subsequent date_time stamps will be separated by character sep="-"
There may or may not be text between subsequent date_time stamps, but will definitely be sep

date_time format:

each individual stamp may or may not contain time (i.e. only date)
If the stamp has a time, the format will be either _HHMM

or_HHMMSS
the date will always be in the format YYYYMMDD

library(stringr)  

string   <- "TEXT_etc_20140530-20140825_1635-"
expected <- "TEXT_etc_20140530"

## using this pattern for the date_time stamp
##  8 digits, optional underscore with 4to6 digits, appearing exactly once, followed by "-"
. (\\d{8}(_\\d{4,6})?){1}-    # I am not concerned with potential of a 5-digit time stamp

## Attempts
pat1 <- "(TEXT)(.*?)(\\d{8}(_\\d{4,6})?){1}-";  str_extract(string, pat=pat1)
pat2 <-            "(\\d{8}(_\\d{4,6})?){1}-";  str_extract(string, pat=pat2)  ## date is correct
pat3 <-       "(.*?)(\\d{8}(_\\d{4,6})?){1}-";  str_extract(string, pat=pat3)
pat4 <-       "(.*?)(\\d{8}){1}-"            ;  str_extract(string, pat=pat4)

## Other potential string patterns
string   <- "TEXT_etc_20140530-diff_txet_20140825_1635-"
string   <- "TEXT_etc_20140530_123456-diff_txet_20140825_1635-"

Can you help me spot the error in my regex?

Note to users not R

: R

requires the escape character to \

be escaped, hence \\

in the code above

+3

regex r

Ricardo saporta 27 Aug 14 at 19:33

source to share

4 answers

What about

strings <- c("TEXT_etc_20140530-20140825_1635-",
    "TEXT_etc_20140530-diff_txet_20140825_1635-",
    "TEXT_etc_20140530_123456-diff_txet_20140825_1635-")

pat <- "^\\w*\\d{8}(_\\d{4,6})?"
str_extract(strings, pat=pat)

which returns

[1] "TEXT_etc_20140530"      "TEXT_etc_20140530"     "TEXT_etc_20140530_123456"

+3

MrFlick 27 Aug 14 at 19:44

source to share

This is one way:

pat <- '^(?U)(.*\\d{8}).*$'
gsub(pat, '\\1', string, perl=TRUE)
# [1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

(?U)

tells the parser to find the shortest match.

+2

Matthew plourde 27 Aug 14 at 19:44

source to share

You can also try:

 library(stringi)
 stri_extract_first_regex(string, "[^0-9]+\\d{8}")
 #[1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

or

 str_extract(string, "[^0-9]+\\d{8}")
 #[1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

To extract time:

 stri_extract_first_regex(string, "[^0-9]+\\d{8}(?:_[0-9]{4,6})?")
 #[1] "TEXT_etc_20140530"        "TEXT_etc_20140530"       
 #[3] "TEXT_etc_20140530_123456"


 #data 
 string  <- c("TEXT_etc_20140530-20140825_1635-",
"TEXT_etc_20140530-diff_txet_20140825_1635-",
"TEXT_etc_20140530_123456-diff_txet_20140825_1635-")

+1

akrun 27 Aug 14 at 19:58

source to share

G. Grothendieck · Accepted Answer · 2014-08-27T19:43:52+0000

Replace 8 digits followed by anything with these 8 digits:

# test data
string  <- c("TEXT_etc_20140530-20140825_1635-",
   "TEXT_etc_20140530-diff_txet_20140825_1635-",
   "TEXT_etc_20140530_123456-diff_txet_20140825_1635-")

sub("(\\d{8}).*", "\\1", string)
## [1] "TEXT_etc_20140530" "TEXT_etc_20140530" "TEXT_etc_20140530"

If extra time needs to be saved, use instead:

sub("(\\d{8}(.\\d{4,6})?)\\b.*", "\\1", string)
## [1] "TEXT_etc_20140530"        "TEXT_etc_20140530"      
## [3] "TEXT_etc_20140530_123456"

Update Added a second solution and made a fix.

A regex to extract the first datetimestamp when multiple are present

More articles: