Aggregate Function Basics

Most developers have a workable knowledge of SQL and are quite familiar with the aggregate functions SUM, COUNT, MIN, MAX and such. Yet I still see logic similar to the following in PL/SQL code:

DECLARE
   l_rows   NUMBER;
BEGIN
   SELECT COUNT(*)
   INTO   l_rows
   FROM   some_table
   WHERE  some_column = some_value;

   dbms_output.put_line ('Rows found: ' || TO_CHAR(l_rows));
   
EXCEPTION
   WHEN NO_DATA_FOUND THEN
      dbms_output.put_line ('No rows found');
END;

To an experienced Oracle developer the defect is quite clear: the SQL statement will never throw the NO_DATA_FOUND exception. If there are no rows in the table that meet the selection criteria then the COUNT(*) function will return a value 0 and so we will always get back a row from the above query.

However, what happens in other scenarios with other aggregate functions? This post explorers some of the basic aspects of aggregate functions that appear to trip up inexperienced application developers.

For the examples in this post I will use the following table and data:

CREATE TABLE agg_demo
   (id      NUMBER(6)    NOT NULL
   ,cat     NUMBER(2)    NOT NULL
   ,val     NUMBER(4)
   ,ts      TIMESTAMP
   )
/

INSERT INTO agg_demo
SELECT ROWNUM AS id
,      MOD(ROWNUM,5) AS cat
,      CASE WHEN TRUNC(ROWNUM/7) = 0
            THEN NULL
            ELSE TRUNC(ROWNUM/7)
       END AS val
FROM   dual
CONNECT BY level <= 100
/

COMMIT
/

So, we have a table with:

  • 100 rows with a unique ID value
  • a CAT column with values that range from 0 to 4
  • a VAL column with values in the range of 0 to 14 that also contains NULLs

Let’s start with a simple row count:

SELECT COUNT(*)
FROM   agg_demo
/

  COUNT(*)
----------
       100

The result is quite obvious: we have 100 rows in our table. Adding a WHERE clause that excludes all rows results in the following:

SELECT COUNT(*)
FROM   agg_demo
WHERE  cat > 20
/

  COUNT(*)
----------
         0

This is the scenario that is similar to the one that I noted at the start of this post. We still get back a result and that result is 0. So, this query will never throw the PL/SQL NO_DATA_FOUND exception. Furthermore, this type of query will never return NULL, i.e. you will always get a value and that value will be an integer of 0 or greater.

When it comes to the COUNT function, there are three common ways that developers use it:

  • COUNT(*)
  • COUNT(1), or some other literal constant
  • COUNT(column)

So, what are the differences in the above:

SELECT COUNT(*) AS "count_*"
,      COUNT(1) AS "count_1"
,      COUNT(val) AS "count_val"
FROM   agg_demo
/

   count_*    count_1  count_val
---------- ---------- ----------
       100        100         94

We can see that COUNT(*) and COUNT(1) returned the same value, 100. Both of these functions are returning a count of the number of rows. As an aside, I have heard people claim that one approach is more efficient than the other but there’s no truth in these claims.

The COUNT(val) returns a different value than the other two COUNT functions, which highlights an important difference in how the COUNT function works. When the COUNT function is applied to a specific column, or any expression, then NULLs are not included in the counted result. So, the result of 94 is the number of non-NULL VAL entries.

Regardless of the way COUNT is expressed, when the WHERE clause eliminates all rows from the table the various COUNT functions will always return a value of 0:

SELECT COUNT(*) AS "count_*"
,      COUNT(1) AS "count_1"
,      COUNT(val) AS "count_val"
FROM   agg_demo
WHERE  cat > 20
/

   count_*    count_1  count_val
---------- ---------- ----------
         0          0          0

It’s interesting to note that COUNT(val) will return a value even if all the rows that the function is applied to contain NULL for the VAL column, i.e. we don’t get back NULL:

SELECT COUNT(*) AS "count_*"
,      COUNT(1) AS "count_1"
,      COUNT(val) AS "count_val"
FROM   agg_demo
WHERE  val IS NULL
/

   count_*    count_1  count_val
---------- ---------- ----------
         6          6          0

The above illustrates that it’s not necessary to include an NVL function around the count of a column in order to force a non-NULL response, as I’ve seen some developers do. For example, the NVL in the following query is unnecessary:

SELECT NVL(COUNT(val),0) AS "count_val"
FROM   agg_demo
WHERE  val IS NULL
/

 count_val
----------
         0

It’s also important to note that the above is very different to the following query:

SELECT COUNT(NVL(val,0)) AS "count_val"
FROM   agg_demo
WHERE  val IS NULL
/

 count_val
----------
         6

In the above query, the NVL function is applied prior to the aggregate COUNT function. This means that the COUNT function is not counting 6 NULLs, which it would normally ignore and therefore report a result of 0. Instead it is counting 6 0’s and returns a count of 6.

Now let’s go back a few steps and add a GROUP BY into the query with no WHERE clause:

SELECT cat
,      COUNT(*) AS "count_*"
,      COUNT(1) AS "count_1"
,      COUNT(val) AS "count_val"
FROM   agg_demo
GROUP  BY
       cat
ORDER  BY
       cat
/

       CAT    count_*    count_1  count_val
---------- ---------- ---------- ----------
         0         20         20         19
         1         20         20         18
         2         20         20         19
         3         20         20         19
         4         20         20         19

Ther’s probably no surprises here, we end up with a count of the rows per CAT for the COUNT(*) and COUNT(1) functions and a count of the non-NULL VAL entries for the COUNT(val) function. Let’s repeat this query with a WHERE clause that eliminates all rows:

SELECT cat
,      COUNT(*) AS "count_*"
,      COUNT(1) AS "count_1"
,      COUNT(val) AS "count_val"
FROM   agg_demo
WHERE  cat > 20
GROUP  BY
       cat
ORDER  BY
       cat
/

no rows selected

Ah, now that’s different. This time we didn’t get any rows back, not even one indicating that the various counts are 0. This is probably where most developers go wrong in assuming that COUNT(*) can return no rows and therefore could potentially raise a NO_DATA_FOUND exception in PL/SQL. Once GROUP BY is included in the SQL then it is possible for the query to return no rows since all rows from the table can be excluded and therefore there is nothing to perform the GROUP BY on.

So far we’ve been looking at the COUNT function. What about other aggregate functions? Let’s give SUM, MIN and AVG a try.

SELECT SUM(val) AS sum_val
,      MIN(val) AS min_val
,      AVG(val) AS avg_val
FROM   agg_demo
/

   SUM_VAL    MIN_VAL    AVG_VAL
---------- ---------- ----------
       679          1 7.22340426

Again, probably no surprises here. For the other aggregate functions we have to specify the column (or expression) on which we are operating, i.e. it makes no sense to try to calculate the average of a row, only a given column (or expression) in the row.

However, on closer consideration, this query does highlight an important point with aggregate functions: NULLs are ignored for the purpose of the calculations performed. For most functions and calculations, the inclusion of NULL will result in a NULL answer. This doesn’t apply to aggregate functions. A simple test illustrates the difference:

SELECT 1 + 2 + NULL AS result
FROM   dual
/

    RESULT
----------


WITH vals AS
   (SELECT 1    AS val FROM dual UNION ALL
    SELECT 2    AS val FROM dual UNION ALL
    SELECT NULL AS val FROM dual)
SELECT SUM(val) AS result
FROM   vals
/

    RESULT
----------
         3
/

In the above test we added the values 1, 2 and NULL using the addition operator (+) and the result returned was NULL. When we used the SUM aggregate function on the same set of values we got a result of 2. So, aggregate functions ignore NULLs when performing their calculations.

Let’s see what happens when we have a WHERE clause that eliminates all rows:

SELECT SUM(val) AS sum_val
,      MIN(val) AS min_val
,      AVG(val) AS avg_val
FROM   agg_demo
WHERE  cat > 20
/

   SUM_VAL    MIN_VAL    AVG_VAL
---------- ---------- ----------

Similar to the COUNT function we get back a result but in this case the results were NULL. This makes sense. COUNT told us that there were no results, i.e. we had a count of 0, so what else should the sum, minimum and average of no values be? However, keep in mind that we did actually get a result so, again, this type of query will not raise a NO_DATA_FOUND exception.

Potentially you may decide that you do not want NULL and instead default to some other value using the NVL function:

SELECT NVL(SUM(val),0) AS sum_val
,      NVL(MIN(val),0) AS min_val
,      NVL(AVG(val),0) AS avg_val
FROM   agg_demo
WHERE  cat > 20
/

   SUM_VAL    MIN_VAL    AVG_VAL
---------- ---------- ----------
         0          0          0

However, this is definitely incorrect from a mathematical perspective and more than likely to be incorrect from a business logic perspective. In a code review I certainly would be questioning the use of this type of logic very, very closely. As an aside, I find it amusing that something that is definitely incorrect mathematically may only be unlikely to be incorrect from a business perspective.

Going back to the GROUP BY example:

SELECT cat
,      SUM(val) AS sum_val
,      MIN(val) AS min_val
,      AVG(val) AS avg_val
FROM   agg_demo
GROUP  BY
       cat
ORDER  BY
       cat
/

       CAT    SUM_VAL    MIN_VAL    AVG_VAL
---------- ---------- ---------- ----------
         0        141          1 7.42105263
         1        130          1 7.22222222
         2        133          1          7
         3        136          1 7.15789474
         4        139          1 7.31578947

We get the sum, minimum and average of the VAL column for each CAT value. There’s nothing unusual here. Likewise, when we add a WHERE clause that eliminates all rows from the table we get:

SELECT cat
,      SUM(val) AS sum_val
,      MIN(val) AS min_val
,      AVG(val) AS avg_val
FROM   agg_demo
WHERE  cat > 20
GROUP  BY
       cat
ORDER  BY
       cat
/

no rows selected

This is the same as for the COUNT function so there should be no surprises here.

Advertisements

Client Identifier

I have mentioned previously about the session variables MODULE and ACTION. Just to recap, these are simply string variables that you can set to reflect what activity it is performing. There’s another variable that I didn’t mention in the previous post; Client Identfier.

Like both MODULE and ACTION, the client identifier is a string variable that can be set to whatever value makes sense according to how you intend to use the variable. The maxmum length for the client identifier is 64 bytes and the value is set by calling the routine DBMS_SESSION.SET_IDENTIFIER, e.g.:

BEGIN
   dbms_session.set_identifier ('my identifier');
END;

The current client identifier value can be accessed via the function call:

SYS_CONTEXT('userenv','client_identifier')

So, how should the client identifier be used? Well, for starters it’s probably best to understand where it’s exposed and what functionality can be tied to it. We can easily query the data dictionary to determine where the value is exposed. The variable is either named CLIENT_ID or CLIENT_IDENTIFIER:

SELECT owner
,      table_name
,      column_name
FROM   dba_tab_columns
WHERE  column_name IN ('CLIENT_ID','CLIENT_IDENTIFIER')
ORDER  BY
       owner
,      table_name

On my 11.2 installation on Windows, this query displayed the following:

OWNER      TABLE_NAME                     COLUMN_NAME
---------- ------------------------------ ------------------------------
SYS        ALL_SCHEDULER_JOBS             CLIENT_ID
SYS        ALL_SCHEDULER_JOB_LOG          CLIENT_ID
SYS        ALL_SCHEDULER_WINDOW_LOG       CLIENT_ID
SYS        DBA_AUDIT_EXISTS               CLIENT_ID
SYS        DBA_AUDIT_OBJECT               CLIENT_ID
SYS        DBA_AUDIT_SESSION              CLIENT_ID
SYS        DBA_AUDIT_STATEMENT            CLIENT_ID
SYS        DBA_AUDIT_TRAIL                CLIENT_ID
SYS        DBA_COMMON_AUDIT_TRAIL         CLIENT_ID
SYS        DBA_FGA_AUDIT_TRAIL            CLIENT_ID
SYS        DBA_HIST_ACTIVE_SESS_HISTORY   CLIENT_ID
SYS        DBA_SCHEDULER_JOBS             CLIENT_ID
SYS        DBA_SCHEDULER_JOB_LOG          CLIENT_ID
SYS        DBA_SCHEDULER_WINDOW_LOG       CLIENT_ID
SYS        GLOBAL_CONTEXT                 CLIENT_IDENTIFIER
SYS        GV_$ACTIVE_SESSION_HISTORY     CLIENT_ID
SYS        GV_$CLIENT_STATS               CLIENT_IDENTIFIER
SYS        GV_$GLOBALCONTEXT              CLIENT_IDENTIFIER
SYS        GV_$LOGMNR_CONTENTS            CLIENT_ID
SYS        GV_$SESSION                    CLIENT_IDENTIFIER
SYS        GV_$SQL_MONITOR                CLIENT_IDENTIFIER
SYS        KET$_CLIENT_CONFIG             CLIENT_ID
SYS        KET$_CLIENT_TASKS              CLIENT_ID
SYS        SCHEDULER$_EVENT_LOG           CLIENT_ID
SYS        SCHEDULER$_JOB                 CLIENT_ID
SYS        SCHEDULER$_LIGHTWEIGHT_JOB     CLIENT_ID
SYS        USER_AUDIT_OBJECT              CLIENT_ID
SYS        USER_AUDIT_SESSION             CLIENT_ID
SYS        USER_AUDIT_STATEMENT           CLIENT_ID
SYS        USER_AUDIT_TRAIL               CLIENT_ID
SYS        USER_SCHEDULER_JOBS            CLIENT_ID
SYS        USER_SCHEDULER_JOB_LOG         CLIENT_ID
SYS        V_$ACTIVE_SESSION_HISTORY      CLIENT_ID
SYS        V_$CLIENT_STATS                CLIENT_IDENTIFIER
SYS        V_$DIAG_ALERT_EXT              CLIENT_ID
SYS        V_$GLOBALCONTEXT               CLIENT_IDENTIFIER
SYS        V_$LOGMNR_CONTENTS             CLIENT_ID
SYS        V_$SESSION                     CLIENT_IDENTIFIER
SYS        V_$SQL_MONITOR                 CLIENT_IDENTIFIER
SYS        WRH$_ACTIVE_SESSION_HISTORY    CLIENT_ID
SYS        WRH$_ACTIVE_SESSION_HISTORY_BL CLIENT_ID
SYSMAN     MGMT_USER_CONTEXT              CLIENT_IDENTIFIER

Note: in the above listing the objects with names starting GV_$ and V_$ are the actual objects names for the equivalent GV$ and V$ names that are commonly used. The commonly used GV$ / V$ are actually synonyms, e.g. V$SESSION is a synonym that references the V_$SESSION object.

Of particular interest is that client identifier appears in V$SESSION, V$ACTIVE_SESSION_HISTORY and V$SQL_MONITOR, which makes it very attractive for monitoring purposes.

So, what value should the client identifier be set to? On the assumption that you are using MODULE and ACTION to track the part of the code a process is executing then I would suggest that client identifier be used to denote the instance of execution. I have worked with systems that have set the client identifier to:

  • A concatenation of various client details, including the end user network username and workstation, to get a value that can be tied to a specific end-user’s activity
  • A GUID to get a unique business process execution that is propagated across the application tiers

Setting the client identifier to reflect the end user initiating the database process provides an easy mechanism to trace individual user activity within the database. Modern applications rarely have separate database accounts for each user, which means that when you are required to investigate the actions of a specific user then it can be difficult to figure out which database session the user was connected to. With the client identifier set to reflect the user then obtaining a history of user activity from Active Session History can be as simple as:

SELECT *
FROM   v$active_session_history
WHERE  client_id = <<user>>

Furthermore, SQL trace can be enabled on sessions according to the client identifier using the routine DBMS_MONITOR.CLIENT_ID_TRACE_ENABLE. This can be particularly useful in those situations where you have a user complaining that the system is consistently giving them trouble even though no one else seems to be experiencing any problems. Tracing can be disabled using DBMS_MONITOR.CLIENT_ID_TRACE_DISABLE.

Additionally, DBMS_MONITOR.CIENT_ID_STAT_ENABLE will start the collecting of statistics for the client identifier provided. These statistics are exposed in the view V$CLIENT_STATS. The statistics available are a bit of a mix of the details exposed in V$SESSSTAT and V$SESS_TIME_MODEL.This provides for an easy way to measure how busy specific users are keeping the database, potentially tying into why they may be experiencing performance troubles. By way of an illustration, we can measure the logical reads used by all activity under a specific client identifier as follows. First we enable statistics collection for the client identifier:

EXEC dbms_monitor.client_id_stat_enable ('session_abc')

The above can be done in any session at any time prior to when you’d like to collect statistics in any session with privileges to run DBMS_MONITOR. We then run a simple workload of creating a table under the client identifier:

SQL> EXEC dbms_session.set_identifier ('session_abc')

PL/SQL procedure successfully completed.

SQL> SELECT stat_name
  2  ,      value
  3  FROM   v$client_stats
  4  WHERE  client_identifier = SYS_CONTEXT('userenv','client_identifier')
  5  AND    stat_name         = 'session logical reads'
  6  /

STAT_NAME                      VALUE
------------------------- ----------
session logical reads              0

SQL> CREATE TABLE stats_test
  2     (id        NUMBER(10) NOT NULL
  3     ,cat       NUMBER(2)  NOT NULL
  4     ,padding   CHAR(30)   NOT NULL
  5     )
  6  /

Table created.

SQL> INSERT INTO stats_test
  2  WITH generator AS
  3     (SELECT --+materialize
  4             ROWNUM AS rn
  5      FROM   dual
  6      CONNECT BY ROWNUM <= 1000)
  7  SELECT ROWNUM
  8  ,      MOD(ROWNUM,10)
  9  ,      'X'
 10  FROM   generator g1
 11  ,      generator g2
 12  /

1000000 rows created.

SQL> EXEC dbms_stats.gather_table_stats ('','STATS_TEST')

PL/SQL procedure successfully completed.

SQL> SELECT stat_name
  2  ,      value
  3  FROM   v$client_stats
  4  WHERE  client_identifier = SYS_CONTEXT('userenv','client_identifier')
  5  AND    stat_name         = 'session logical reads'
  6  /

STAT_NAME                      VALUE
------------------------- ----------
session logical reads          51706

So, we can see that our table creation process performed 51,706 logical reads. You can also see that we can determine the current client identifier using the call to SYS_CONTEXT(‘userenv’,’client_identifier’). Lastly, we probably want to disable statistics collection on the client identifier:

EXEC dbms_monitor.client_id_stat_disable ('session_abc')

The client identifier provides for a particularly useful mechanism for tracking individual user activity within the database. It is easy to establish and exposed in many locations within the database. The hardest aspect of utilising it is coming up with a strategy for setting it. However, once an application has been configured to utilise client identifier then tracking individual activity becomes quite a trivial activity.