Clearing dates (specifically years) with regex

I have a database with an unaudited year. Most of the entries are 4-digit years, but about 10% are "anything". This led me to a regex rabbit hole to help a little. Getting better results than I have, progress even if I don't extract 100%.

#what a mess
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96  ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
#does a good job with any string containing a 4-digit year
as.numeric(sub('\\D*(\\d{4}).*', '\\1', yearEntries))
#does a good job with any string containing a 2-digit year, nought else
as.numeric(sub('\\D*(\\d{2}).*', '\\1', yearEntries))

      

The desired output is to capture the first year to be read, so 1992-1993 would be 1992 and the 70s would be 1970.

How can I improve the accuracy of my analysis? Thank!

EDIT: As per garyh's answers, this is much closer to me:

sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\d)|\\d{4}).*","\\1",yearEntries,perl=TRUE)
# [1] "79"        "07-2608"   "07-262008" "96"        "70"        "93"        "70"        "15"        "60"       "70"        NA          "2013"      "1992"

      

but note that while dates with a dash in them work with the garyh regex101.com demo, they don't work with R, keeping the month and day values ​​and the first dash.

Also, I understand that I did not include the date with a slash, not a dash. Another term in the regex is to handle this, but again, with R, it doesn't produce the same (correct) result that regex101.com does.

sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\/|\\d)|\\d{4}).*","\\1","07/09/13",perl=TRUE)
# [1] "07/0913"

      

These negative back-images and gazes are very powerful, but they stretch my weak brain.

+3


source to share


4 answers


Not sure what flavor is used in the regex R, but this seems to be all the years in the line

/((?<!\d)\d{2}(?!\-|\d)|\d{4})/g

      



This matches any four digits, or any two digits unless followed by a dash -

or digit, or preceded by another digit

see demo here

+2


source


You will need some elbow grease and do something like:

library(lubridate)

yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96  ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")

x <- yearEntries
x <- gsub("(late|early)", "", x, ignore.case=TRUE)
x <- gsub("[']*[s]*", "", x)
x <- gsub(",.*$", "", x)
x <- gsub(" ", "", x)
x <- ifelse(nchar(x)==9 | nchar(x)<8, gsub("[-/]+[[:digit:]]+$", "", x), x)
x <- ifelse(nchar(x)==4, gsub("^[[:digit:]]{2}", "", x), x)
y <- format(parse_date_time(x, "%m-%d-%y!"), "%y")

yearEntries <-ifelse(!is.na(y), y, x)

yearEntries
##  [1] "79" "08" "08" "96" "70" "93" "70" "15" "60" "70" NA   "13" "92"

      



We don't know what year you'll need an out-of-range entry, but that should get you started.

+1


source


I found a very simple way to get a good result (although I would not argue that this is bullet proof). It captures the last year being read, which is fine too.

yearEntries <- c("79, 80, 99","07/26/08","07-26-2008","'96  ","Early 70's","93/95","15",NA,"2013","1992-1993","ongoing")
# assume last two digits present in any string represent a 2-digit year 
a<-sub(".*(\\d{2}).*$","\\1",yearEntries)
#  [1] "99"      "08"      "08"      "96"      "70"      "95"      "15"      "ongoing" NA        "13"      "93"   
# change to numeric, strip NAs and add 2000
b<-na.omit(as.numeric(a))+2000
# [1] 2099 2008 2008 2096 2070 2095 2015 2013 2093
# assume any greater than present is last century
b[b>2015]<-b[b>2015]-100
#  [1] 1999 2008 2008 1996 1970 1995 2015 2013 1993

      

... and Bob is your uncle!

0


source


@garyth regex works great if you use the combination regmatches

/ grexpr

to extract the template instead of sub

:

regmatches(yearEntries, 
           gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE))
[[1]]
[1] "79" "80" "99"

[[2]]
[1] "08"

[[3]]
[1] "2008"

[[4]]
[1] "96"

[[5]]
[1] "70"

[[6]]
[1] "95"

[[7]]
[1] "70"

[[8]]
[1] "15"

[[9]]
[1] "60"

[[10]]
[1] "70"

[[11]]
character(0)

[[12]]
[1] "2013"

[[13]]
[1] "1992" "1993"

      

To keep only the first match pattern:

sapply(regmatches(yearEntries,gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE)),`[`,1)
 [1] "79"   "08"   "2008" "96"   "70"   "95"   "70"   "15"   "60"   "70"   NA     "2013" "1992"
regmatches("07/09/13",gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}","07/09/13",perl=TRUE))
 [[1]]
 [1] "13"

      

0


source







All Articles