How to determine a set of dates from a string in rails
I have the following lines
"sep 04 apr 06"
"29th may 1982"
"may 2006 may 2008"
"since oct 11"
Output
"September 2004 and April 2006"
"29 May 1982"
"May 2006 and May 2008"
"October 2011"
Is there any way to get dates from this string. I used the gem 'date_from_string', but it was unable to get the date correctly from the first script.
source to share
The approach I have taken is as follows:
- Split the string into an array of words.
- If the array contains less than two words, return an array containing all the date strings found; otherwise go to step 3.
- If the array contains at least three words and the first three words represent a date, save it, remove the first three words in the array, and repeat step 2; otherwise go to step 4.
- If the first two words represent a date, save it, delete the first two words in the array, and repeat step 2; go to step 5.
- Remove the first word in the array and go to step 2.
I am looking for dates using the Date :: strptime class method . strptime
the format string is used. For example, '%d %b %Y'
searches for the day of the month, followed by a space, followed by a three-character (case insensitive) month abbreviation (Jan, Feb, ..., Dec) followed by a four-digit year. (I am initially considering using Date :: parse , but that does not provide an adequate estimate of dates.)
code
First, I create all the format strings of strptime
interest for the month, day, and year:
MON = %w{ %b %B } # '%b' for 'Jan', '%B' for 'January'
YR = %w{ %y %Y } # '%y' for '11', '%Y' for 2011
DAY = %w{ %d } # '4', '04' or '28'
PERM3 = MON.product(YR, DAY).
flat_map { |arr| arr.permutation(3).to_a }.
map { |arr| arr.join(' ') }
#=> ["%b %y %d", "%b %d %y", "%y %b %d", "%y %d %b", "%d %b %y", "%d %y %b",
# "%b %Y %d", "%b %d %Y", "%Y %b %d", "%Y %d %b", "%d %b %Y", "%d %Y %b",
# "%B %y %d", "%B %d %y", "%y %B %d", "%y %d %B", "%d %B %y", "%d %y %B",
# "%B %Y %d", "%B %d %Y", "%Y %B %d", "%Y %d %B", "%d %B %Y", "%d %Y %B"]
Then I do the same for permutations of day and month, month and year:
PERM2 = MON.product(YR).
concat(MON.product(DAY)).
flat_map { |arr| arr.permutation(2).to_a }.
map { |arr| arr.join(' ') }
#=> ["%b %y", "%y %b", "%b %Y", "%Y %b", "%B %y", "%y %B",
# "%B %Y", "%Y %B", "%b %d", "%d %b", "%B %d", "%d %B"]
Then I do the following:
require 'date'
def pull_dates(str)
arr = str.split
dates = []
while arr.size > 1
if arr.size > 2
a = depunc(arr[0,3])
if date?(a, PERM3)
dates << a.join(' ')
arr.shift(3)
next
end
end
a = depunc(arr[0,2])
if date?(a, PERM2)
dates << a.join(' ')
arr.shift(2)
next
end
arr.shift
end
dates
end
depunc
removes any punctuation at the beginning and end of the line arr.join(' ')
.
def depunc(arr)
arr.join(' ').gsub(/^\W|\W$/,'').split
end
date?
determines whether a three- or two-element string will arr
represent a date. First I get a "cleared" string from arr
, and then I go through the matching format strings strptime
(argument perm
), and look for one that shows that the cleared string can be converted to a date.
def date?(arr, perm)
clean = to_str_and_clean(arr)
perm.find do |s|
begin
d = Date.strptime(clean, s)
return true
rescue
false
end
end
false
end
to_str_and_clean
returns the cleaned line with the removal of punctuation and line, such as 'st'
, 'nd'
, 'rd'
and 'th'
after the numerical representation of the day.
def to_str_and_clean(arr)
str = arr.map { |s| s[0][/\d/] ? s.to_i.to_s : s }.join(' ').tr('.?!,:;', '')
end
Example
Try it.
str =
"Bubba sighted a flying saucer on sep 04 2013 and again in apr 06. \
Greta was born on 29th may 1982. Hey, may 2006 may 2008 are two years apart.\
We have been at loose ends since oct 11 of this year."
pull_dates(str)
#=> ["sep 04 2013", "apr 06", "29th may 1982", "may 2006 may", "oct 11"]
Well, as you can see, it's not perfect. Some tweaking is required, but that might get you started.
source to share
When you say, "Unfortunately, I cannot predict what format the date should be in.", You are implying that you really need "natural language analysis". Something that a core Date or DateTime cannot and should not do.
So, you will need to parse the strings so that you can present them to the stricter parser in an understandable format. How DateTime.parse('sep 04')
. For your examples, it might be as simple as:
datestring = 'sep 04 apr 06'
matches = datestring.match(/[a-z]{3}\s\d{2,4}/)
if matches.many?
matches.map{|m| Date.parse(m) }.join(' and ')
else
Date.parse(datestring)
end
However, if you want true parsing in the language, check out Chronic . Which has all sorts of fancy parsers like Chronic.parse('summer')
.
Edit: Upon closer inspection, it seems that Chronic can only identify one line too, so your example 'sep 04 apr 06'
still needs preprocessing.
source to share