Time Interval Grouping

On occasion I’ve needed to group time-based data into interval periods, e.g. 5 minute intervals. Oracle functions such as TRUNC and ROUND will help grouping to the hour or minute but what about other intervals? Unfortunately that’s when things get a little bit more complex. I need to do this just frequently enough to curse myself for not properly documenting the strategies I used the previous time I had to do it. So, this post is an attempt to rectify that and looks at a couple of strategies for grouping time measurements into arbitrary intervals.

Set-up

The interval grouping examples that follow will be based on the following table:

CREATE TABLE log_table
  (log_ts     TIMESTAMP     NOT NULL
  ,log_msg    VARCHAR2(100) NOT NULL
  )
/

-- insert 10000 rows into the table spread across the day
INSERT INTO log_table
  (log_ts, log_msg)
SELECT CAST((TRUNC(SYSDATE) + (dbms_random.value (0,1000)/1000))  AS TIMESTAMP)
,      'X' 
FROM   dual
CONNECT BY ROWNUM <= 10000
/

COMMIT
/

and just so we see the necessary date precision in SQL*Plus, we’ll change the NLS_DATE_FORMAT:

ALTER SESSION SET nls_date_format = 'dd/mm/yyyy hh24:mi'
/

All of the following strategies will be based on grouping the data in the table into 15 minute intervals and providing a count of the table entries for each interval.

Strategy 1
If the grouping is to be a relatively small number groups on an hourly basis then using CASE can be effective:

SELECT TRUNC(log_ts,'HH') + CASE WHEN EXTRACT(MINUTE FROM log_ts) < 15 THEN 0
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 30 THEN (15/60/24)
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 45 THEN (30/60/24)
                                 ELSE (45/60/24)
                             END AS interval_start
,      TRUNC(log_ts,'HH') + CASE WHEN EXTRACT(MINUTE FROM log_ts) < 15 THEN (15/60/24)
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 30 THEN (30/60/24)
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 45 THEN (45/60/24)
                                 ELSE (60/60/24)
                             END AS interval_end
,      COUNT(*) AS no_msgs
FROM   log_table
GROUP  BY
       TRUNC(log_ts,'HH') + CASE WHEN EXTRACT(MINUTE FROM log_ts) < 15 THEN 0
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 30 THEN (15/60/24)
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 45 THEN (30/60/24)
                                 ELSE (45/60/24)
                             END
,      TRUNC(log_ts,'HH') + CASE WHEN EXTRACT(MINUTE FROM log_ts) < 15 THEN (15/60/24)
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 30 THEN (30/60/24)
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 45 THEN (45/60/24)
                                 ELSE (60/60/24)
                             END
ORDER  BY
       TRUNC(log_ts,'HH') + CASE WHEN EXTRACT(MINUTE FROM log_ts) < 15 THEN 0
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 30 THEN (15/60/24)
                                 WHEN EXTRACT(MINUTE FROM log_ts) < 45 THEN (30/60/24)
                                 ELSE (45/60/24)
                             END

This strategy is quite clear on what’s being achieved to anyone reading the code. However, it is really only suitable to scenarios that have a limited number of groupings so it wouldn’t be practical to use this strategy if the intervals were to be, say, 2 minutes.

As an aside, the EXTRACT function will only extract sub-day intervals from TIMESTAMP data types and not the DATE data type, despite DATE having a time component. The Oracle documentation notes that this is due to how Oracle DATE data types map to ANSI standard DATE data types. This is easily overcome by casting the DATE to a TIMESTAMP, i.e. CAST(log_ts AS TIMESTAMP). Personally I find this behaviour rather annoying though…

Strategy 2

The TO_CHAR function contains the format model SSSSS, which extracts the number of seconds since midnight from the DATE provided. Using this function it’s relatively easy to do some simple arithmetic to round the number of seconds to the nearest interval boundary; divide by the number of seconds in the interval, discard any fraction and multiply by the number of seconds in the interval:

WITH intvl AS (SELECT 15*60 AS no_secs FROM dual)
SELECT TRUNC(log_ts) + ((TRUNC(TO_NUMBER(TO_CHAR(log_ts,'SSSSS'))/i.no_secs)*i.no_secs)*(1/24/60/60)) AS interval_start
,      TRUNC(log_ts) + (((TRUNC(TO_NUMBER(TO_CHAR(log_ts,'SSSSS'))/i.no_secs)+1)*i.no_secs)*(1/24/60/60)) AS interval_end
,      COUNT(*) AS no_logs
FROM   log_table
,      intvl i
GROUP  BY
       TRUNC(log_ts) + ((TRUNC(TO_NUMBER(TO_CHAR(log_ts,'SSSSS'))/i.no_secs)*i.no_secs)*(1/24/60/60))
,      TRUNC(log_ts) + (((TRUNC(TO_NUMBER(TO_CHAR(log_ts,'SSSSS'))/i.no_secs)+1)*i.no_secs)*(1/24/60/60))
ORDER  BY
       TRUNC(log_ts) + ((TRUNC(TO_NUMBER(TO_CHAR(log_ts,'SSSSS'))/i.no_secs)*i.no_secs)*(1/24/60/60))

The above query uses a WITH block to factor out the 15 minute interval in seconds so that it doesn’t need to be hard-coded throughout the query. This makes it easy to alter the query to group on other intervals. The primary downside to the above query is that it’s not quite so easy to understand as the CASE strategy but it is probably more flexible.

Strategy 3

The next interval grouping strategy uses a WITH block to generate the set of intervals that the data is to be group into and then performs a join with the data in the table:

WITH int_dtls AS 
   (SELECT 15 * 60            AS no_secs
    ,      TRUNC(SYSDATE)     AS start_time
    ,      TRUNC(SYSDATE) + 1 AS end_time
    FROM   dual)
,intvls AS
   (SELECT  start_time + (no_secs/24/60/60*(ROWNUM-1)) AS interval_start
    ,       start_time + (no_secs/24/60/60*(ROWNUM))   AS interval_end
    FROM    int_dtls
    CONNECT BY
            start_time + (no_secs/24/60/60*(ROWNUM-1)) < end_time)
SELECT i.interval_start
,      i.interval_end
,      COUNT(*) AS no_logs
FROM   log_table l
,      intvls i
WHERE  l.log_ts >= i.interval_start
AND    l.log_ts <  i.interval_end
GROUP  BY
       i.interval_start
,      i.interval_end
ORDER  BY
       i.interval_start
,      i.interval_end

A drawback of this strategy is that the date range needs to be specified in the WITH block. However a key advantage of is that it’s possible to outer join the interval ranges to the table so as to generate an entry for all interval ranges even if there doesn’t exist an entry in the table for that particular period, i.e.:

-- remove the entries between 1pm adn 2pm
DELETE
FROM   log_table
WHERE  EXTRACT(HOUR FROM log_ts) = 13
/

COMMIT
/

WITH int_dtls AS 
   (SELECT 15 * 60            AS no_secs
    ,      TRUNC(SYSDATE)     AS start_time
    ,      TRUNC(SYSDATE) + 1 AS end_time
    FROM   dual)
,intvls AS
   (SELECT  start_time + (no_secs/24/60/60*(ROWNUM-1)) AS interval_start
    ,       start_time + (no_secs/24/60/60*(ROWNUM))   AS interval_end
    FROM    int_dtls
    CONNECT BY
            start_time + (no_secs/24/60/60*(ROWNUM-1)) < end_time)
SELECT i.interval_start
,      i.interval_end
,      COUNT(l.log_ts) AS no_logs
FROM   log_table l
,      intvls i
WHERE  l.log_ts(+) >= i.interval_start
AND    l.log_ts(+) <  i.interval_end
GROUP  BY
       i.interval_start
,      i.interval_end
ORDER  BY
       i.interval_start
,      i.interval_end

Note that in the above query we cannot count the number of log entries using COUNT(*) and had to change this to COUNT(l.log_ts), i.e. the number of entries in the log table itself.

I’m sure that others have different strategies for grouping data into intervals not noted above so if none of the above are suitable then a Google search should reveal other strategies. All of the strategies are based on groupings that are X minutes. If the grouping interval were to be X days then a bit of rework would be needed.

Hopefully the next time I find myself needing to do this I will remember this post and not find myself reinventing the wheel again…


Download the scripts for the code used in this post.

Advertisements

Referencing Implicit Cursor Attributes (SQL%…)

You are probably familiar with the implicit cursor attributes available within PL/SQL; ROWCOUNT, FOUND, NOTFOUND, etc. As a quick review:

  • SQL%ROWCOUNT returns the number of rows modified by the prevous SQL statement
  • SQL%FOUND returns TRUE if the previous SQL statement returned a result (SELECT) or if a DML statement modified any rows
  • SQL%NOTFOUND is the opposite of SQL%FOUND and returns TRUE if the previous SQL statement did not return any result or modify any data
  • SQL%ISOPEN always returns FALSE for the implicit cursor as the cursor is closed after the execution of the SQL statement

So, a a simple example, if I wanted to find out how many rows an UPDATE statement had modified I would code something like the following:

...
UPDATE emp
SET    sal = sal * 1.05
WHERE  deptno = 10;

dbms_output.put_line ('Number of employees updated: ' || TO_CHAR(SQL%ROWCOUNT));
...

I do not intend to dwell on the use of the use of the attributes but rather focus on their placement within the code. The following sums up my opinion:

...
SQL statement (SELECT, INSERT, UPDATE, DELETE, MERGE)
<<<--- NOTHING EVER, EVER, EVER GOES HERE!!!!!!! --->>>
SQL%attribute use
...

My reasoning is quite simple, the implicit cursor attributes relate to the last SQL statement executed. If any code appear between the SQL statement and the use of the implicit cursor attribute then there is a risk that the implicit cursor attribute no longer reflects the SQL statement that you believe it to. For example, consider this code snippet:

...
UPDATE emp
SET    sal = sal * 1.05
WHERE  deptno = l_deptno;

dbms_output.put_line ('Number of employees updated: ' || TO_CHAR(SQL%ROWCOUNT));

IF (SQL%ROWCOUNT = 0) THEN
   RAISE_APPLICATION_ERROR (-20000,'ERROR: No employees updated!');
END IF;   
...

So, in the above code, we’re doing a simple update to the EMP table (l_deptno would be a variable), writing out a trace message using DBMS_OUTPUT and then raising an exception if we find that no rows got updated. Quite simple and nothing obviously amiss.

Suppose however that a developer later comes along and decides the DBMS_OUTPUT is not a suitable facility for writing trace messages. The developer has created a routine LOG.MSG and, since the call interface to this routine is the same as the PUT_LINE routine, a search and replace on the code should work nicely. Hence the code ends up as:

...
UPDATE emp
SET    sal = sal * 1.05
WHERE  deptno = l_deptno;

log.msg ('Number of employees updated: ' || TO_CHAR(SQL%ROWCOUNT));

IF (SQL%ROWCOUNT = 0) THEN
   RAISE_APPLICATION_ERROR (-20000,'ERROR: No employees updated!');
END IF;   
...

The bug lies in the reference to SQL%ROWCOUNT in the IF block. We might think that the reference to the implicit cursor relates to the UPDATE statement but we’re now making a call to LOG.MSG between the SQL statement and the use of the SQL%ROWCOUNT. If within LOG.MSG, we execute something like:

INSERT INTO log_msg (...)
VALUES (...)

then the SQL%ROWCOUNT in the IF bock will be based on the INSERT operation contained in LOG.MSG and not the UPDATE statement we think it refers to.

Initial examination of the code shows nothing obviously amiss and the code compiles. Furthermore, it’s likely that the code will pass a number of unit tests. It will only fail a test that deliberately updates no EMP rows and tries to ensure the correct exception is raised. Hopefully such a unit test exists…

If I were feeling particularly cautious, I would recommend the following:

SQL statement (SELECT, INSERT, UPDATE, DELETE, MERGE)
assign SQL%attriubte to local variable
Use local variable instead of SQL% attribute here

By placing the implicit cursor attributes required into local variables the risk of anything going wrong is further lessened. Defensive programming techniques like this help make code more robust over a longer period of time.

Okay, I’m sure you’re getting my point by now…

… but one further thing to note before I finish up. Let’s take a look at a slightly different scenario:

...
UPDATE emp
SET    sal = sal * 1.05
WHERE  deptno = 10;

COMMIT;

dbms_output.put_line ('Number of employees updated: ' || TO_CHAR(SQL%ROWCOUNT));
...

In this scenario I’ve added a COMMIT between the UPDATE statement and the DBMS_OUTPUT line. So, what does SQL%ROWCOUNT return?

The answer is 0. In the above scenario, the SQL%ROWCOUNT reflects the row count of the COMMIT statement, not the UPDATE statement. The implicit cursor attributes reflect the operation of the last SQL statement and isn’t restricted to simple DML statements. COMMIT and ROLLBACK will return 0 for the ROWCOUNT attribute. A “CREATE TABLE … AS SELECT…” operation, if executed using EXECUTE IMMEDIATE, will return the number of rows inserted into the newly created table.

So, in summary, I would urge developers to be careful of where they place references to the implicit cursor attributes. If any code comes between the SQL statement and the implicit cursor attribute reference then the attribute value should be captured into a local variable and the code should then reference the variable.