How to determine a set of dates from a string in rails

I have the following lines

"sep 04 apr 06"
"29th may 1982"
"may 2006 may 2008"
"since oct 11"

      

Output

"September 2004 and April 2006"
"29 May 1982"
"May 2006 and May 2008"
"October 2011"

      

Is there any way to get dates from this string. I used the gem 'date_from_string', but it was unable to get the date correctly from the first script.

+3


source to share


4 answers


The approach I have taken is as follows:

  • Split the string into an array of words.
  • If the array contains less than two words, return an array containing all the date strings found; otherwise go to step 3.
  • If the array contains at least three words and the first three words represent a date, save it, remove the first three words in the array, and repeat step 2; otherwise go to step 4.
  • If the first two words represent a date, save it, delete the first two words in the array, and repeat step 2; go to step 5.
  • Remove the first word in the array and go to step 2.

I am looking for dates using the Date :: strptime class method . strptime

the format string is used. For example, '%d %b %Y'

searches for the day of the month, followed by a space, followed by a three-character (case insensitive) month abbreviation (Jan, Feb, ..., Dec) followed by a four-digit year. (I am initially considering using Date :: parse , but that does not provide an adequate estimate of dates.)

code

First, I create all the format strings of strptime

interest for the month, day, and year:

MON = %w{ %b %B } # '%b' for 'Jan', '%B' for 'January'
YR  = %w{ %y %Y } # '%y' for '11', '%Y' for 2011
DAY = %w{ %d }    # '4', '04' or '28' 

PERM3 = MON.product(YR, DAY).
            flat_map { |arr| arr.permutation(3).to_a }.
            map { |arr| arr.join(' ') }
  #=> ["%b %y %d", "%b %d %y", "%y %b %d", "%y %d %b", "%d %b %y", "%d %y %b",
  #    "%b %Y %d", "%b %d %Y", "%Y %b %d", "%Y %d %b", "%d %b %Y", "%d %Y %b",
  #    "%B %y %d", "%B %d %y", "%y %B %d", "%y %d %B", "%d %B %y", "%d %y %B",
  #    "%B %Y %d", "%B %d %Y", "%Y %B %d", "%Y %d %B", "%d %B %Y", "%d %Y %B"] 

      

Then I do the same for permutations of day and month, month and year:

PERM2 = MON.product(YR).
            concat(MON.product(DAY)).
            flat_map { |arr| arr.permutation(2).to_a }.
            map { |arr| arr.join(' ') }               
  #=> ["%b %y", "%y %b", "%b %Y", "%Y %b", "%B %y", "%y %B",
  #    "%B %Y", "%Y %B", "%b %d", "%d %b", "%B %d", "%d %B"] 

      

Then I do the following:

require 'date'

def pull_dates(str)
  arr = str.split
  dates = []
  while arr.size > 1
    if arr.size > 2
      a = depunc(arr[0,3])
      if date?(a, PERM3)
        dates << a.join(' ')
        arr.shift(3)
        next
      end
    end
    a = depunc(arr[0,2])
    if date?(a, PERM2)
      dates << a.join(' ')
      arr.shift(2)
      next
    end
    arr.shift
  end
  dates
end

      



depunc

removes any punctuation at the beginning and end of the line arr.join(' ')

.

def depunc(arr)
  arr.join(' ').gsub(/^\W|\W$/,'').split  
end

      

date?

determines whether a three- or two-element string will arr

represent a date. First I get a "cleared" string from arr

, and then I go through the matching format strings strptime

(argument perm

), and look for one that shows that the cleared string can be converted to a date.

def date?(arr, perm)
  clean = to_str_and_clean(arr)
  perm.find do |s|
    begin
      d = Date.strptime(clean, s)
      return true
    rescue
      false 
    end
  end
  false
end

      

to_str_and_clean

returns the cleaned line with the removal of punctuation and line, such as 'st'

, 'nd'

, 'rd'

and 'th'

after the numerical representation of the day.

def to_str_and_clean(arr)
  str = arr.map { |s| s[0][/\d/] ? s.to_i.to_s : s }.join(' ').tr('.?!,:;', '')
end

      

Example

Try it.

str =
"Bubba sighted a flying saucer on sep 04 2013 and again in apr 06. \
Greta was born on 29th may 1982. Hey, may 2006 may 2008 are two years apart.\
We have been at loose ends since oct 11 of this year."

pull_dates(str)
  #=> ["sep 04 2013", "apr 06", "29th may 1982", "may 2006 may", "oct 11"] 

      

Well, as you can see, it's not perfect. Some tweaking is required, but that might get you started.

+1


source


When you say, "Unfortunately, I cannot predict what format the date should be in.", You are implying that you really need "natural language analysis". Something that a core Date or DateTime cannot and should not do.

So, you will need to parse the strings so that you can present them to the stricter parser in an understandable format. How DateTime.parse('sep 04')

. For your examples, it might be as simple as:

datestring = 'sep 04 apr 06'
matches = datestring.match(/[a-z]{3}\s\d{2,4}/)
if matches.many?
  matches.map{|m| Date.parse(m) }.join(' and ')
else
  Date.parse(datestring)
end

      



However, if you want true parsing in the language, check out Chronic . Which has all sorts of fancy parsers like Chronic.parse('summer')

.

Edit: Upon closer inspection, it seems that Chronic can only identify one line too, so your example 'sep 04 apr 06'

still needs preprocessing.

+2


source


You can use the DateTime class :

DateTime.parse('sep 04 apr 06')

      

which outputs a DateTime object:

#<DateTime: 2006-04-04T00:00:00+00:00 ((2453830j,0s,0n),+0s,2299161j)>

      

0


source


0


source







All Articles