Quickly convert timestamps to calculate duration

We have a log analyzer that analyzes logs on the order of 100 GB (my test file is ~ 20 million lines, 1.8 GB). It takes longer than I would like (up to half a day), so I ran it against cProfile and> 75% of the time is taken by strptime:

       1    0.253    0.253  560.629  560.629 <string>:1(<module>)
20000423  202.508    0.000  352.246    0.000 _strptime.py:299(_strptime)

      

to calculate the duration between log entries, currently:

ltime = datetime.strptime(split_line[time_col].strip(), "%Y-%m-%d %H:%M:%S")
lduration = (ltime - otime).total_seconds()

      

where otime

is the timestamp from the previous line

The log files are formatted line by line:

0000 | 774 | 475      | 2017-03-29 00:06:47 | M      |        63
0001 | 774 | 475      | 2017-03-29 01:09:03 | M      |        63
0000 | 774 | 475      | 2017-03-29 01:19:50 | M      |        63
0001 | 774 | 475      | 2017-03-29 09:42:57 | M      |        63
0000 | 775 | 475      | 2017-03-29 10:24:34 | M      |        63
0001 | 775 | 475      | 2017-03-29 10:33:46 | M      |        63    

      

It takes almost 10 minutes to run it against a test file.

Replacing strptime()

to this (from this question ):

def to_datetime(d):
    ltime = datetime.datetime(int(d[:4]), 
                              int(d[5:7]), 
                              int(d[8:10]), 
                              int(d[11:13]), 
                              int(d[14:16]), 
                              int(d[17:19]))

      

brings it to just over 3 minutes.

cProfile reports again:

       1    0.265    0.265  194.538  194.538 <string>:1(<module>)
20000423   62.688    0.000   62.688    0.000 analyzer.py:88(to_datetime)

      

this conversion still takes about a third of the time to run the entire analyzer. In-line flattens out the conversion area by about 20%, but we're still looking 25% of the time processing those lines, converting the timestamp to format datetime

(while total_seconds()

consuming ~ 5% more on top of that).

I can end up just logging a custom timestamp in seconds for a full crawl datetime

if anyone else has another bright idea?

+2


source to share


1 answer


So I kept looking and I found a module that does a fantastic job:

Introducing ciso8601 :

from ciso8601 import parse_datetime
...
ltime = parse_datetime(sline[time_col].strip())

      



What, via cProfile:

       1    0.254    0.254  123.795  123.795 <string>:1(<module>)
20000423    4.188    0.000    4.188    0.000 {ciso8601.parse_datetime}

      

which is 84x faster than the naive approach through datetime.strptime()

... which is not surprising considering I wrote a C module to do this .

+2


source







All Articles