Package: pkg_delay

Working on a traditional OLTP system means dealing with lots of short transactions. Handling concurrency correctly is a significant concern and application design is tailored to avoid blocking where possible. Having said that, occasionally concurrency problems do arise. Identifying the source of the problem can be tricky but resolution is often simple. The largest problem often lies in the testing of the solution.

In a system where the duration of transactions is measured in milliseconds it can be near impossible to deliberately create the same scenario that gave rise to a concurrency issue. For example, a software tester recently encountered a deadlock whilst running a trivial test one day, unrelated to the functionality that they were testing. After looking into the trace files generated by Oracle it was found that two separate code paths obtained locks on resources in a reverse manner, i.e. one process obtained locks on A and then B, whilst the other did B and then A. This problem had never occurred in production as the probability of it occurring was very, very small… but it had occurred by chance in the test environment. Fixing the code was a trivial task but in order to test that the problem had been resolved we had to slow down the code to widen the window of opportunity for it to occur. This was done by inserting a sleep between where the two locks were obtained. Hence PKG_DELAY was born…

Hard coding a DBMS_LOCK.SLEEP command into the PL/SQL code is a crude but effective way of delaying code execution at some critical point. PKG_DELAY has a couple of advantages over a simple call to DBMS_LOCK.SLEEP:

  • The delay period is controlled via a global context. This permits the delay to be set by a separate session when the test is ready to start, rather than have it always on
  • The delay period can be adjusted at run time, again due to it being controlled via a global context. This has proved useful in tests that have sought to assess the impact of slow processes within the system.
  • Any code calling out to PKG_DELAY will have a dependency on this package. Since PKG_DELAY is never deployed to the production environment, if any test code containing a call out to PKG_DELAY is deployed then it will fail compilation. Not the best scenario perhaps but better than deploying code with a sleep in it…

PKG_DELAY has just two routines:

  • set_delay
    This routine takes a name for a delay and a duration for that delay
  • delay
    Performs a sleep of a duration specified by the set_delay routine for the name specified

So, a delay call might be inserted into code as:

...
pkg_delay.delay ('BEFORE_OPERATION');
...

The code will now run as per usual with virtually no overhead from the call to PKG_DELAY. Once the test is ready to be performed, requiring say a 2 second delay, in a separate session the following is executed:

EXEC pkg_delay.set_delay ('BEFORE_OPERATION',2)

Download PKG_DELAY

skip_unusable_indexes

Back in some version of Oracle 10 the session modifiable parameter skip_unuable_indexes changed its default value from FALSE to TRUE. This meant that operations on tables with an unusable index did not fail with an error if Oracle arrived at an execution plan that utilised the index. For all intents and purposes Oracle ignores that the index is present… well, nearly ignores them as we will see shortly.

First of all, let’s take a quick look of what skip_unusable_indexes does using a 1,000,000 row table named LOTS_OF_ROWS:

SQL>CREATE TABLE lots_of_rows
  2     (id         NUMBER (10) NOT NULL
  3     ,cat        NUMBER (4)  NOT NULL
  4     ,padding    CHAR(30)    NOT NULL)
  5  /

Table created.

SQL>ALTER TABLE lots_of_rows
  2     ADD CONSTRAINT lots_of_rows_pk
  3     PRIMARY KEY (id)
  4  /

Table altered.

SQL>CREATE INDEX lots_of_rows_ix1
  2     ON lots_of_rows (cat)
  3  /

Index created.

SQL>INSERT INTO lots_of_rows
  2  WITH row_src AS
  3     (SELECT ROWNUM r FROM dual CONNECT BY ROWNUM <= 1000)
  4  SELECT ROWNUM
  5  ,      MOD(ROWNUM,1000)
  6  ,      'X'
  7  FROM   row_src r2
  8  ,      row_src r2
  9  /

1000000 rows created.

SQL>COMMIT
  2  /

Commit complete.

SQL>EXEC dbms_stats.gather_table_stats ('','lots_of_rows')

PL/SQL procedure successfully completed.

SQL>ALTER SESSION SET skip_unusable_indexes = TRUE
  2  /

Session altered.

SQL>SET AUTOTRACE TRACEONLY EXPLAIN

SQL>SELECT *
  2  FROM   lots_of_rows
  3  WHERE  cat = 42
  4  /

Execution Plan
----------------------------------------------------------
Plan hash value: 3680026695

--------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                  |  1000 | 40000 |  1005   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| LOTS_OF_ROWS     |  1000 | 40000 |  1005   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | LOTS_OF_ROWS_IX1 |  1000 |       |     5   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CAT"=42)

SQL>SET AUTOTRACE OFF

In the above we created a table, LOTS_OF_ROWS, with a non-unique index on the CAT column, which has 1,000 distinct values, and populated it with 1,000,000 rows. We then ran a query selecting out the rows with a CAT value of 42. Not surprisingly Oracle elected to use the index when running the index.

Now let’s see what happens if we make the index unusable and rerun the query:

SQL>ALTER INDEX lots_of_rows_ix1 UNUSABLE
  2  /

Index altered.

SQL>SET AUTOTRACE TRACEONLY EXPLAIN

SQL>SELECT *
  2  FROM   lots_of_rows
  3  WHERE  cat = 42
  4  /

Execution Plan
----------------------------------------------------------
Plan hash value: 3711500201

----------------------------------------------------------------------------------
| Id  | Operation         | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |              |  1000 | 40000 |  1687   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| LOTS_OF_ROWS |  1000 | 40000 |  1687   (1)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("CAT"=42)

SQL>SET AUTOTRACE OFF

The previously generated execution plan for our query, using the index, clearly could not be used but Oracle did not fail the query. Instead Oracle came up with a new plan, using a table scan. Perhaps not the best of plans but our query did run without error.

Now let’s see what happens if SKIP_UNUSABLE_INDEXES is set to FALSE:

SQL>ALTER SESSION SET skip_unusable_indexes = FALSE
  2  /

Session altered.

SQL>SET AUTOTRACE TRACEONLY EXPLAIN

SQL>SELECT *
  2  FROM   lots_of_rows
  3  WHERE  cat = 42
  4  /
SELECT *
*
ERROR at line 1:
ORA-01502: index 'DEVELOPER.LOTS_OF_ROWS_IX1' or partition of such index is in unusable state

SQL>SET AUTOTRACE OFF

Not surprisingly, with the SKIP_UNUSABLE_INDEXES set to FALSE Oracle failed the query.

However, can ORA-01502 be throw even if SKIP_UNUSABLE_INDEXES is set to TRUE? The answer is yes so let’s look at how this might happen:

SQL>ALTER SESSION SET skip_unusable_indexes = TRUE
  2  /

Session altered.

SQL>SET AUTOTRACE TRACEONLY EXPLAIN

SQL>SELECT /*+ index (lots_of_rows lots_of_rows_ix1) */
  2         *
  3  FROM   lots_of_rows
  4  WHERE  cat = 42
  5  /
SELECT /*+ index (lots_of_rows lots_of_rows_ix1) */
*
ERROR at line 1:
ORA-01502: index 'DEVELOPER.LOTS_OF_ROWS_IX1' or partition of such index is in unusable state

SQL>SET AUTOTRACE OFF

In the above example, we explicitly hinted the query to use the unusable index. In this situation Oracle will attempt to abide by the hint and fail the statement when it finds the index is unusable. The also applies using the variation of the INDEX hint that specifies the column name, rather than the index name:

SQL>SET AUTOTRACE TRACEONLY EXPLAIN

SQL>SELECT /*+ index (lots_of_rows (cat)) */
  2         *
  3  FROM   lots_of_rows
  4  WHERE  cat = 42
  5  /
SELECT /*+ index (lots_of_rows (cat)) */
*
ERROR at line 1:
ORA-01502: index 'DEVELOPER.LOTS_OF_ROWS_IX1' or partition of such index is in unusable state

SQL>SET AUTOTRACE OFF

To summarise this post then:

  • When the SKIP_UNUSABLE_INDEXES parameter is set to TRUE it allows Oracle to ignore any indexes in an unusable state, and this the standard default setting
  • Skipping disabled indexes can lead to poor execution plans, which should not be surprising since those indexes were created for a reason
  • Queries specifically hinted to use an index will not ignore the unusable index and will throw an ORA-01502 error

Existence checks using COUNT(*)

One thing guaranteed to draw my ire in a code review is the use of COUNT(*) to test for the existence of data. This can be done in a number of ways. For example, a simple count via a SELECT statement and then checking if the result was 1 or greater:

...
SELECT COUNT(*)
INTO   l_rows_exist
FROM   some_table
WHERE  ... blah,blah,blah;

IF (l_rows_exist >= 1) THEN
   ... blah,blah,blah;
END IF;

or as some sort of logical sub-query construct inside a query:

SELECT ... blah,blah,blah
FROM   some_table
WHERE  (SELECT COUNT(*) FROM some_other_table WHERE ... blah,blah,blah) >= 1

Unless there’s some constraint involved that will guarantee that the query will return at the most one row counting all entries to simply detect if one entry exists is asking for trouble. Oracle will cheerfully table scan a billion row table looking for all rows that satisfy the query even after it found a matching entry in the first block examined.

Let’s see that in action using a 1,000,000 row table, using AUTOTRACE to obtain some statistics:

SQL>CREATE TABLE lots_of_rows
  2     (id      NUMBER(10) NOT NULL
  3     ,cat     NUMBER(2)  NOT NULL
  4     ,padding CHAR(30)   NOT NULL)
  5  /

Table created.

SQL>-- generate 1,000,000 rows
SQL>INSERT INTO lots_of_rows
  2  WITH row_src AS
  3     (SELECT ROWNUM r FROM dual CONNECT BY ROWNUM <= 1000)
  4  SELECT ROWNUM
  5  ,      MOD(ROWNUM,100)
  6  ,      'X'
  7  FROM   row_src r2
  8  ,      row_src r2
  9  /

1000000 rows created.

SQL>COMMIT
  2  /

Commit complete.

SQL>EXEC dbms_stats.gather_table_stats ('','lots_of_rows')

PL/SQL procedure successfully completed.

SQL>SET AUTOTRACE TRACEONLY

SQL>SELECT 'X'
  2  FROM   dual
  3  WHERE  (SELECT COUNT(*)
  4          FROM   lots_of_rows
  5          WHERE  cat = 42) >= 1
  6  /


Execution Plan
----------------------------------------------------------
Plan hash value: 3376271402

------------------------------------------------------------------------------------
| Id  | Operation           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |              |     1 |       |  1652   (1)| 00:00:01 |
|*  1 |  FILTER             |              |       |       |            |          |
|   2 |   FAST DUAL         |              |     1 |       |     2   (0)| 00:00:01 |
|   3 |   SORT AGGREGATE    |              |     1 |     3 |            |          |
|*  4 |    TABLE ACCESS FULL| LOTS_OF_ROWS | 10000 | 30000 |  1650   (1)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( (SELECT COUNT(*) FROM "LOTS_OF_ROWS" "LOTS_OF_ROWS" WHERE
              "CAT"=42)>=1)
   4 - filter("CAT"=42)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       6069  consistent gets
          0  physical reads
          0  redo size
        536  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

The execution plan above clearly shows a full table scan and the consistent gets statistic would confirm that the entire table was indeed scanned. The solution to this situation is trivial: rewrite the query to use EXISTS. In our example above, this would be to use:

SELECT ... blah,blah,blah
FROM   some_table
WHERE  EXISTS (SELECT COUNT(*) FROM some_other_table WHERE ... blah,blah,blah)

The impact of this simple change can be seen using AUTOTRACE:

SQL>SELECT 'X'
  2  FROM   dual
  3  WHERE  EXISTS(SELECT 'X'
  4                FROM   lots_of_rows
  5                WHERE  cat = 42)
  6  /


Execution Plan
----------------------------------------------------------
Plan hash value: 4135382906

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |       |     4   (0)| 00:00:01 |
|*  1 |  FILTER            |              |       |       |            |          |
|   2 |   FAST DUAL        |              |     1 |       |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| LOTS_OF_ROWS |     2 |     6 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( EXISTS (SELECT 0 FROM "LOTS_OF_ROWS" "LOTS_OF_ROWS" WHERE
              "CAT"=42))
   3 - filter("CAT"=42)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets
          0  physical reads
          0  redo size
        536  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

While the execution plan still shows a full table scan the consistent gets has dropped to just 3. This would have been the number that Oracle performed before it found the fist “42” entry and stopped scanning the table. The query also has be advantage of being better documented as EXISTS describes perfectly what if being sought.

However, perhaps my ire at the use of COUNT(*) is somewhat misplaced. The Oracle optimiser is capable of detecting subqueries in the form of:

(SELECT COUNT(*) FROM ....) > 0

and automatically translate it into the equivalent EXISTS form:

EXISTS (SELECT 'X' FROM ....)

We can see this in action using our million row table:

SQL>SELECT 'X'
  2  FROM   dual
  3  WHERE  (SELECT COUNT(*)
  4          FROM   lots_of_rows
  5          WHERE  cat = 42) > 0
  6  /


Execution Plan
----------------------------------------------------------
Plan hash value: 4135382906

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |       |     4   (0)| 00:00:01 |
|*  1 |  FILTER            |              |       |       |            |          |
|   2 |   FAST DUAL        |              |     1 |       |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| LOTS_OF_ROWS |     2 |     6 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( EXISTS (SELECT 0 FROM "LOTS_OF_ROWS" "LOTS_OF_ROWS" WHERE
              "CAT"=42))
   3 - filter("CAT"=42)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          3  consistent gets
          0  physical reads
          0  redo size
        536  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

So, while we wrote the query to use SELECT COUNT(*) the predicate information section in the above AUTOTRACE output shows that Oracle actually used EXISTS. Note, this rewrite only occurs if the test is “> 0″. The rewrite will not take place using a test of “>= 1″, which is what the original example above used.

I’m not sure what version of Oracle that this automatic rewrite appeared. It certainly works in 11.2 and above. It’s a nice feature but I would still argue that it would not be good practice to rely of this kind of thing and to explicitly write the query using EXISTS.

Oracle performance problem case study 2

I had just settled down for a bite to eat for lunch when two developers approached me with rather guilty expressions. Clearly something was up and, given that there had been a significant deployment for the part of the system that they were responsible for the night before, I immediately suspected a production performance problem.

This was confirmed when they told me that one of the production processes was taking far too long; 20 to 30 seconds instead of sub-second. Their suspicion was that they had some sort of query performance problem and they needed help to identify and resolve it.

They gave me the name of the service and the two PL/SQL routines that were taking too long. Each routine invoked multiple SQL statements. Once again I turned to Active Session History, which tends to be my first tool of choice for this type of problem. I ran a query similar to the following to see what long running statements had originated from the service for the past hour:

SELECT session_id
,      session_serial#
,      program
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
,      COUNT(*) AS duration_secs
,      MIN(sample_time) AS start_exec
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    session_type = 'FOREGROUND'
AND    program = :service_name
GROUP  BY
       program
,      session_id
,      session_serial#
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
ORDER  BY
       COUNT(*) DESC

The query result showed a lot of 2 and 3 second queries with just two SQL_IDs listed. Taking these two SQL_IDs, I ran a query like:

SELECT *
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sql_id    = :sql_id
ORDER  BY
       sample_time

The event details showed that the queries were spending their time on CPU so no contention or other wait problems that would point to some system-wide issue. More telling though was the SQL_PLAN_OPERATION and SQL_PLAN_OPTIONS. These details showed full table scans were being done.

With this information it was time to find out where the SQL statements were and where they resided, using the following query:

SELECT s.sql_text
,      (SELECT o.object_name FROM dba_objects o WHERE s.program_id = o.object_id) AS program_name
,      s.program_line#
FROM   v$sql
WHERE  sql_id = :sql_id

and what their execution plans were:

SELECT *
FROM   TABLE(dbms_xplan.display_cursor(:sql_id,:sql_child_number))

As it turned out both SQL statements were relatively simple SELECTs along the lines of:

SELECT t1.col_1
,      t1.col_2
,      t2.col_1
,      t2.col_2
FROM   t1
,      t2
WHERE  t1.id = t2.id
AND    (:param1 IS NULL OR t1.p1 = :param1)
AND    (:param2 IS NULL OR t2.p2 = :param2)

The tables had indexes on the ID, P1 and P2 columns. While there were two variables used by the queries the developers explained that only one would be specified and the other would be NULL. The execution plan Oracle had chosen, running under version 11.2.0.4, was to table scan both tables and hash join them.

The queries were rather nice examples of a developer trying to do too much in a single query and performance suffering as a result. Each query were designed to satisfy two distinct scenarios. While those two scenarios result in very similar queries it had forced Oracle into a position where it has no choice but to scan the tables involved. The developer had hoped that Oracle would choose the relevant index, on P1 or P2, depending on what parameters were provided, i.e. Oracle would separate out the two different scenarios at run time.

The solution to this problem was to rewrite the queries separating out the different scenarios. One way was to use a UNION query:

SELECT t1.col_1
,      t1.col_2
,      t2.col_1
,      t2.col_2
FROM   t1
,      t2
WHERE  t1.id = t2.id
AND    t1.p1 = :param1
UNION
SELECT t1.col_1
,      t1.col_2
,      t2.col_1
,      t2.col_2
FROM   t1
,      t2
WHERE  t1.id = t2.id
AND    t2.p2 = :param2

In the above whichever parameter is NULL will result in no rows being returned from that part of the query and so the result will only be from the part where the parameters is specified. Best of all Oracle would be able to optimise the query to use indexes for each part and so the performance problem would be overcome.

Once I pointed out the problem to the developers they elected to go with a slightly different solution though. The query was SELECT INTO query contained within the PL/SQL package so they simply split it using conditional logic:

IF l_param1 IS NOT NULL) THEN
   SELECT t1.col_1
   ,      t1.col_2
   ,      t2.col_1
   ,      t2.col_2
   INTO   l_var_1
   ,      l_var_2
   ,      l_var_3
   ,      l_var_4
   FROM   t1
   ,      t2
   WHERE  t1.id = t2.id
   AND    t1.p1 = l_param1;
ELSE
   SELECT t1.col_1
   ,      t1.col_2
   ,      t2.col_1
   ,      t2.col_2
   INTO   l_var_1
   ,      l_var_2
   ,      l_var_3
   ,      l_var_4
   FROM   t1
   ,      t2
   WHERE  t1.id = t2.id
   AND    t2.p2 = l_param2;
END IF;   

The change was a simple rewrite to the two problematic routines. As the developers had heeded advice about avoiding session state the change was easily deployed into the production system without requiring an outage. Performance for both routines dropped from 2 to 3 seconds to around 1 millisecond.

I went back to lunch…

Oracle performance problem case study 1

I am regularly asked how to go about diagnosing and troubleshooting Oracle performance problems, which is then followed up with “how did you work that out?” after the problem has been resolved. These are the very same questions I asked when I started out working with Oracle. As a result I thought I’d start making posts outlining particular problems I encounter, the steps taken to diagnose cause of the problem and the solution. “Oracle Voodoo” is the term the developers use where I work as, when first introduced to Oracle, they lacked understanding of how the database worked and so, jokingly, equated performance troubleshooting to magic.

The following problem started with a call from another developer about “a service that’s timing out”. The key bits of information provided were the name of the service, the business function it was trying to perform and a relatively small time period when the problem had occurred. The time-out for the service was 60 seconds. Since the business function was to simply retrieve some data from the database it was assumed that the problem lay inside Oracle as opposed to the services layer or network.

Given the above information I turned to Active Session History to help me out. The one second sampling used by ASH made it a perfect source of data for this type of problem. The system is well instrumented with the Client Identifier set for individual calls, which makes tracing calls by a service easy even when connection pooling is used. Of course, in order to use Active Session History you must be licensed for it.

I ran a query like the following to determine if there were any long running SQL statements around the time of the reported problem:

SELECT session_id
,      session_serial#
,      program
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
,      COUNT(*) AS duration_secs
,      MIN(sample_time) AS start_exec
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    session_type = 'FOREGROUND'
GROUP  BY
program
,      session_id
,      session_serial#
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
ORDER  BY
       COUNT(*) DESC

As the above query groups by SQL_ID and SQL_EXEC_ID it reports the duration of each query invocation. As luck would have it the query did highlight sessions originating from the problem service that exceeded 60 seconds.

Taking one of the sessions I examined the ASH details for the long running query:

SELECT *
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    client_id = :client_id
AND    sql_id    = :sql_id
ORDER  BY
       sample_time

The EVENT details reported by the above query showed no contention, just normal IO operations. So we had a query that was running smoothly but waaaaay too long. My next step was to look at the SQL itself and where it originated from:

SELECT s.sql_text
,      (SELECT o.object_name FROM dba_objects o WHERE s.program_id = o.object_id) AS program_name
,      s.program_line#
FROM   v$sql
WHERE  sql_id = :sql_id

All SQL statements in the application are contained within packages so the above query identifies which package and the line number where it resides. Normally in a situation like this I would also have looked at the query execution plan and possibly compared it to the plans stored in AWR to see if there had been a plan change. However in this situation I was aware that a table referenced by the query had recently been partitioned. A quick check showed that the query had not been modified to correctly reference the partitioning key of the table and was therefore performing an index lookup against all the partitions of a local index. Problem identified.

However a patch was required to the code and the query resided in a package that was pivotal to the operation of the system. The system was mission critical to the business and could not be brought down. Thankfully attention had been given to avoiding package state so, after fixing the query and running a quick test of the patched code in a test environment, a new version of the package was deployed straight into the live system without causing any trouble.

Problem solved. From taking the call to deploying the new code took just over 30 minutes, and half of that time was filling in the appropriate change control “paperwork”.

Qualified naming… again

Back in December 2012 I blogged about a range of problems that can occur if columns referenced in queries are not qualified with the name or alias of the table that they belong to. This post is mostly a rehash of just one of those scenarios as it seems to crop up time and time again…

Suppose you are presented with the following:

DELETE
FROM   emp
WHERE  dept_id IN (SELECT dept_id FROM dept WHERE dept_name = 'Dept_5')

You would look at the above statement and think that it will do something along the lines of “delete the entries in EMP where they relate to a DEPT entry with the name of ‘Dept_5′”. You might be right… but if the tables were defined as follows you would be wrong:

CREATE TABLE dept
   (id         NUMBER (4) NOT NULL   
   ,dept_name  VARCHAR2 (30) NOT NULL)
/

ALTER TABLE dept
   ADD CONSTRAINT dept_pk
   PRIMARY KEY (id)
/

CREATE TABLE emp
   (id         NUMBER (4)     NOT NULL
   ,first_name VARCHAR2 (30)  NOT NULL
   ,last_name  VARCHAR2 (30)  NOT NULL
   ,dept_id    NUMBER (4)     NOT NULL)
/
   
ALTER TABLE emp
   ADD CONSTRAINT  emp_pk
   PRIMARY KEY (id)
/

ALTER TABLE emp
   ADD CONSTRAINT emp_fk1
   FOREIGN KEY (dept_id)
   REFERENCES dept (id)
/

If we were to populate these tables with 10 DEPT entries and 100 EMP entries and run our delete:

SQL>INSERT INTO dept
  2  SELECT ROWNUM
  3  ,      'Dept_' || TO_CHAR(ROWNUM)
  4  FROM   dual
  5  CONNECT BY ROWNUM <= 10
  6  /

10 rows created.

SQL>INSERT INTO emp
  2  SELECT ROWNUM
  3  ,      'E_' || TO_CHAR(ROWNUM)
  4  ,      'Employee_' || TO_CHAR(ROWNUM)
  5  ,      MOD(ROWNUM,10)+1
  6  FROM   dual
  7  CONNECT BY ROWNUM <= 100
  8  /

100 rows created.

SQL>COMMIT
  2  /

Commit complete.

SQL>DELETE
  2  FROM   emp
  3  WHERE  dept_id IN (SELECT dept_id
  4                     FROM   dept
  5                     WHERE  dept_name = 'Dept_5')
  6  /

100 rows deleted.

SQL>SELECT *
  2  FROM   emp
  3  /

no rows selected

Ummm… where’s my data gone?

Let’s rewrite the query to show what really happened:

DELETE
FROM   emp e
WHERE  e.dept_id IN (SELECT e.dept_id 
                     FROM   dept d
                     WHERE  d.dept_name = 'Dept_5')

Now we can see what went wrong. The DEPT tables does not contain a column named DEPT_ID so Oracle has referenced the column in the EMP table. It can do this as identifiers in an outer query are visible inside a sub-query. Hence when Oracle is unable to find an identifier local to the sub-query then it will expand its search to the outer query.

Our delete query therefore will delete any EMP entry where the DEPT_ID is not null if there exists an entry in DEPT with a DEPT_NAME of ‘Dept_5′. Since DEPT_ID is defined as NOT NULL and there does exist a DEPT entry named ‘Dept_5′ we end up deleting all entries from EMP. A rather nasty outcome.

The best way to prevent this type of a problem is to always qualify column names with the table name or table alias. If we had done that with our DELETE query we would have got the following:

SQL>DELETE
  2  FROM   emp e
  3  WHERE  e.dept_id IN (SELECT d.dept_id
  4                       FROM   dept d
  5                       WHERE  d.dept_name = 'Dept_5')
  6  /
WHERE  e.dept_id IN (SELECT d.dept_id
                            *
ERROR at line 3:
ORA-00904: "D"."DEPT_ID": invalid identifier

Our qualified reference would have resulted in an error. Even better, if the DELETE was included as part of a package then the package would have thrown a compilation error.

Echoing my advice from the previous post: always use qualified references in your queries. The effort is negligible so there’s no excuse not to.

Multi-versioning Read Consistency

Data concurrency is difficult. Oracle does such a good job at hiding the complexity that many developers simply don’t consider it. I’m going to focus this post on one of the aspects of Oracle’s multi-versioning read consistency that new developers really should know about.

You’re probably aware that Oracle does not allow the reading of dirty data, i.e. data that has been modified by another session but that has not been committed. This allows a session to keep its changes private until it decides to commit or rollback those changes. On commit then all the changes are immediately available for other sessions to see. In light of this, ever considered what happens when a long running query accesses data that is being modified by other sessions, for example session 1 runs a simple SELECT query that takes 1 minute to complete while at the same time session 2 is modifying and committing changes to the data? Unlike some other database systems, readers and writers don’t block each other in Oracle so what will the query output for session 1 look like?

The answer is surprisingly simple: a query runs using a view of the data that was in effect as at the time the query started. Changes made to the data for the duration of the operation are ignored. In Oracle this is termed multi-versioning read consistency. The simplest and clearest way of illustrating this that I’ve seen is contained in Tom Kyte’s excellent book Expert Oracle Database Architecture:

SQL>CREATE TABLE t1
  2     (col_x   NUMBER(10) NOT NULL)
  3  /

Table created.

SQL>INSERT INTO t1
  2  SELECT ROWNUM
  3  FROM   dual
  4  CONNECT BY ROWNUM <= 5
  5  /

5 rows created.

SQL>COMMIT
  2  /

Commit complete.

SQL>VARIABLE rc REFCURSOR

SQL>BEGIN
  2     OPEN :rc
  3     FOR
  4     SELECT *
  5     FROM   t1;
  6  END;
  7  /

PL/SQL procedure successfully completed.

SQL>DECLARE
  2     PRAGMA AUTONOMOUS_TRANSACTION;
  3  BEGIN
  4     DELETE
  5     FROM   t1;
  6     COMMIT;
  7  END;
  8  /

PL/SQL procedure successfully completed.

SQL>SELECT *
  2  FROM   t1
  3  /

no rows selected

SQL>PRINT rc

     COL_X
----------
         1
         2
         3
         4
         5

In the above we create a table and populate it with 5 rows. Next we open a cursor that simply selects the data from the table. In an autonomous transaction, which simulates another session, we delete the contents from the table and commit that change. We can see that the table has no rows in it by selecting from the table. However, when we finally get around to pulling the details from the cursor we opened before the deletion we see that all the data. So, regardless of the changes that occurred to the table the cursor displays data as at the point in time that it was opened.

Multi-versioning read consistency is also the reason why the following doesn’t result in an infinite loop (using the same T1 table, repopulated with 5 rows):

SQL>BEGIN
  2     FOR i IN (SELECT col_x FROM t1)
  3     LOOP
  4         dbms_output.put_line ('Inserting value ' || TO_CHAR(i.col_x));
  5        INSERT INTO t1 VALUES (i.col_x);
  6     END LOOP;
  7  END;
  8  /
Inserting value 1
Inserting value 2
Inserting value 3
Inserting value 4
Inserting value 5

PL/SQL procedure successfully completed.

Here we’re inserting into the same table that our cursor is reading from. If the data from the cursor was not fixed at the point when it was opened then the rows we’re inserting would be read by the cursor and our loop would never end… well, not until some resource on the server gives out.

One thing to note with multi-versioning read consistency is that it only applies to a single SQL statement and not an entire transaction of multiple statements. So, in the scenario of two SELECT statements, one reading parent entries and one child entries, then the second SELECT may return data for child entries for which the first SELECT did not return the parent entry. This would occur if another session inserted and committed parent/child entries after the parent table SELECT is executed but before the child table SELECT commenced.

It is also worth noting that the read consistency does not extend to SQL statements contained in PL/SQL routines invoked by an SQL statement. To demonstrate this we’ll reuse our T1 table from above and also create another table, T2, with a single column and row. On insert into T1 we’ll run a autonomous transaction trigger that will increment the value in T2. We’ll also create a function that reads the T2 value and use that to insert into T1:

SQL>CREATE TABLE t2
  2     (col_y  NUMBER(10))
  3  /

Table created.

SQL>INSERT INTO t2
  2  VALUES (1)
  3  /

1 row created.

SQL>COMMIT
  2  /

Commit complete.

SQL>CREATE OR REPLACE TRIGGER trg_t1
  2     BEFORE INSERT ON t1
  3     FOR EACH ROW
  4  DECLARE
  5     PRAGMA AUTONOMOUS_TRANSACTION;
  6  BEGIN
  7     UPDATE t2
  8     SET col_y = col_y + 1;
  9     COMMIT;
 10  END trg_t1;
 11  /

Trigger created.

SQL>CREATE OR REPLACE FUNCTION fnc_1
  2  RETURN NUMBER
  3  AS
  4     l_return_value NUMBER;
  5  BEGIN
  6     SELECT col_y
  7     INTO   l_return_value
  8     FROM   t2;
  9     RETURN l_return_value;
 10  END fnc_1;
 11  /

Function created.

SQL>INSERT INTO t1
  2  SELECT fnc_1
  3  FROM   dual
  4  CONNECT BY ROWNUM <= 5
  5  /

5 rows created.

SQL>SELECT *
  2  FROM   t1
  3  /

     COL_X
----------
         1
         2
         3
         4
         5

SQL>SELECT *
  2  FROM   t2
  3  /

     COL_Y
----------
         6

Here can see that the values in T1 reflect the value stored in T2 for each row inserted into T1, i.e. the value returned by the SQL SELECT in function FNC_1 was not locked to what the data looked like when the overall INSERT was executed. Developers who like to call out to PL/SQL functions inside SQL are likely to be causing themselves concurrency headaches.