Big Query SQL: determining time intervals of minimum length when a condition is met

Question

Big Query SQL: determining time intervals of minimum length when a condition is met

A simplified version of my problem is that I have a table with the following fields: id, timestamp, and a numeric variable (speed). I need to define time periods (start and end timestamps) where the average speed is less than a threshold (say 2), but where the time interval (end timestamp - start timestamp) is at least the minimum duration (say 5 hours or more). Basically, I need to compute the average for the initial 5 hour window, and if the average is less than the threshold, store the initial timestamp and step forward one line with end_timestamp and recalculate the average. If the new average is less than the threshold, hold the step forward again, expanding the time window. If the new average exceeds the threshold,report the previous end_timestamp as the end_timestamp for that window and start a new start_timestamp and calculate a new 5 hour forward average forecast. Ultimately, the final product is a table with a set of start_timestamps, end_timestamps (and a calculated duration) between which the average speed was less than 2 and the time between start and end is at least 5 hours.

I am using Google Big Query: Here is the general framework I have so far, but it doesn't seem to work as I would like. First, it only checks and reports the speed threshold for the initial 5 hour window ... even if the window is growing. Second, it looks like it is not growing correctly. Rarely is this window much longer than 5 hours, even though when viewing my data, in some cases it should be twice as long. I hope someone has tried to develop a similar analysis and can shed some light on where mine is going wrong.

SELECT
*,
LEAD(start_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_start_timestamp,
LEAD(end_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_end_timestamp
FROM (
SELECT
*,
IF(last_timestamp IS NULL
  OR timestamp - last_timestamp > 1000000*60*60*5, TRUE, FALSE) AS start_timestamp, #1000000*60*60*5 = 5 hours in microseconds
IF(next_timestamp IS NULL
  OR next_timestamp - timestamp > 1000000*60*60*5, TRUE, FALSE) AS end_timestamp #1000000*60*60*5 = 5 hours in microseconds
FROM (
SELECT
  *,
  LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) last_timestamp,
  LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp,
FROM (
  SELECT
    *,
    AVG(speed) OVER (PARTITION BY id ORDER BY timestamp RANGE BETWEEN 5 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) AS avg_speed_last_period,
  FROM (
      SELECT
        id,
        timestamp,
        speed
      FROM
        [dataset.table1]))
WHERE
  avg_speed_last_period < 2
ORDER BY
  id,
  timestamp)
HAVING
  start_timestamp
  OR end_timestamp)

EDIT: Here is a link to sample_data . Given this data and the requirement for an average speed of less than 2 for at least 5 hours, the first row of the output table will hopefully be

 ID    start_event                   end_event             average_speed    duration_hrs
 203   2015-01-08 17:40:06 UTC    2015-01-09 07:09:35 UTC     0.7802        13.491

 203   2015-01-10 03:43:56 UTC    2015-01-10 08:48:57 UTC     1.452       5.083

+3

sql google-bigquery

Nate miller Apr 18 17 at 22:33

source to share

1 answer

Mikhail Berlyant · Accepted Answer · 2017-04-20T22:59:38+0000

From your CSV, I assume below schema

With the data below:

With this in mind - below is working code for BigQuery Standard SQL
Exactly what you expect with the output

 id                 start_event                 end_event   average_speed   duration_hrs
203     2015-01-08 17:40:00 UTC   2015-01-09 07:09:00 UTC            0.78          13.48  
203     2015-01-10 03:43:00 UTC   2015-01-10 08:48:00 UTC            1.45           5.08

#standardSQL
CREATE TEMPORARY FUNCTION IdentifyTimeRanges(
  items ARRAY<STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>>, 
  min_length INT64, threshold FLOAT64, max_speed FLOAT64
)
RETURNS ARRAY<STRUCT<start_event TIMESTAMP, end_event TIMESTAMP, average_speed FLOAT64, duration_hrs FLOAT64>>
LANGUAGE js AS """
  var result = [];
  var initial = 0;
  var candidate = items[initial].ts;
  var len = 0;
  var sum = 0;
  for (i = 0; i < items.length; i++) {
    len++;
    sum += items[i].speed

    if (items[i].ts - candidate < min_length) {
      if (items[i].speed > max_speed) {
        initial = i + 1;
        candidate = items[initial].ts;
        len = 0;
        sum = 0;
      }     
      continue;
    }

    if (sum / len > threshold || items[i].speed > max_speed) {
      avg_speed = (sum - items[i].speed) / (len - 1);
      if (avg_speed <= threshold && items[i - 1].ts - items[initial].ts >= min_length) {
        var o = [];
        o.start_event = items[initial].datetime;
        o.average_speed = avg_speed.toFixed(3);
        o.end_event = items[i - 1].datetime;
        o.duration_hrs = ((items[i - 1].ts - items[initial].ts)/60/60).toFixed(3)
        result.push(o)
      }
      initial = i;
      candidate = items[initial].ts;
      len = 1;
      sum = items[initial].speed;
    }

  };

  return result;
""";

WITH data AS (
  SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
  FROM `yourTable`
), compact_data AS (
  SELECT id, ARRAY_AGG(STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>(UNIX_SECONDS(datetime), speed, datetime) ORDER BY UNIX_SECONDS(datetime)) AS points
  FROM data
  GROUP BY id
)
SELECT 
  id, start_event, end_event, average_speed, duration_hrs
FROM compact_data, UNNEST(IdentifyTimeRanges(points, 5*60*60, 2, 3.1)) AS segment
ORDER BY id, start_event

Note: This code uses User-Defined Functions

what is limits

, quotas

and cost hit

for you depends on the size of your data.

Also keep in mind - if the datatype of your datetime field is not STRING - you only need to adjust a little data subquery

- the rest should stay as it is!

For example, if datetime is TIMESTAMP data type - you just need to replace

  SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
  FROM `yourTable`

from

  SELECT id, datetime, speed
  FROM `yourTable`

Hope you like it: o)

Big Query SQL: determining time intervals of minimum length when a condition is met

More articles: