Oracle performance problem case study 1

I am regularly asked how to go about diagnosing and troubleshooting Oracle performance problems, which is then followed up with “how did you work that out?” after the problem has been resolved. These are the very same questions I asked when I started out working with Oracle. As a result I thought I’d start making posts outlining particular problems I encounter, the steps taken to diagnose cause of the problem and the solution. “Oracle Voodoo” is the term the developers use where I work as, when first introduced to Oracle, they lacked understanding of how the database worked and so, jokingly, equated performance troubleshooting to magic.

The following problem started with a call from another developer about “a service that’s timing out”. The key bits of information provided were the name of the service, the business function it was trying to perform and a relatively small time period when the problem had occurred. The time-out for the service was 60 seconds. Since the business function was to simply retrieve some data from the database it was assumed that the problem lay inside Oracle as opposed to the services layer or network.

Given the above information I turned to Active Session History to help me out. The one second sampling used by ASH made it a perfect source of data for this type of problem. The system is well instrumented with the Client Identifier set for individual calls, which makes tracing calls by a service easy even when connection pooling is used. Of course, in order to use Active Session History you must be licensed for it.

I ran a query like the following to determine if there were any long running SQL statements around the time of the reported problem:

SELECT session_id
,      session_serial#
,      program
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
,      COUNT(*) AS duration_secs
,      MIN(sample_time) AS start_exec
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    session_type = 'FOREGROUND'
GROUP  BY
program
,      session_id
,      session_serial#
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
ORDER  BY
       COUNT(*) DESC

As the above query groups by SQL_ID and SQL_EXEC_ID it reports the duration of each query invocation. As luck would have it the query did highlight sessions originating from the problem service that exceeded 60 seconds.

Taking one of the sessions I examined the ASH details for the long running query:

SELECT *
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    client_id = :client_id
AND    sql_id    = :sql_id
ORDER  BY
       sample_time

The EVENT details reported by the above query showed no contention, just normal IO operations. So we had a query that was running smoothly but waaaaay too long. My next step was to look at the SQL itself and where it originated from:

SELECT s.sql_text
,      (SELECT o.object_name FROM dba_objects o WHERE s.program_id = o.object_id) AS program_name
,      s.program_line#
FROM   v$sql
WHERE  sql_id = :sql_id

All SQL statements in the application are contained within packages so the above query identifies which package and the line number where it resides. Normally in a situation like this I would also have looked at the query execution plan and possibly compared it to the plans stored in AWR to see if there had been a plan change. However in this situation I was aware that a table referenced by the query had recently been partitioned. A quick check showed that the query had not been modified to correctly reference the partitioning key of the table and was therefore performing an index lookup against all the partitions of a local index. Problem identified.

However a patch was required to the code and the query resided in a package that was pivotal to the operation of the system. The system was mission critical to the business and could not be brought down. Thankfully attention had been given to avoiding package state so, after fixing the query and running a quick test of the patched code in a test environment, a new version of the package was deployed straight into the live system without causing any trouble.

Problem solved. From taking the call to deploying the new code took just over 30 minutes, and half of that time was filling in the appropriate change control “paperwork”.

Qualified naming… again

Back in December 2012 I blogged about a range of problems that can occur if columns referenced in queries are not qualified with the name or alias of the table that they belong to. This post is mostly a rehash of just one of those scenarios as it seems to crop up time and time again…

Suppose you are presented with the following:

DELETE
FROM   emp
WHERE  dept_id IN (SELECT dept_id FROM dept WHERE dept_name = 'Dept_5')

You would look at the above statement and think that it will do something along the lines of “delete the entries in EMP where they relate to a DEPT entry with the name of ‘Dept_5′”. You might be right… but if the tables were defined as follows you would be wrong:

CREATE TABLE dept
   (id         NUMBER (4) NOT NULL   
   ,dept_name  VARCHAR2 (30) NOT NULL)
/

ALTER TABLE dept
   ADD CONSTRAINT dept_pk
   PRIMARY KEY (id)
/

CREATE TABLE emp
   (id         NUMBER (4)     NOT NULL
   ,first_name VARCHAR2 (30)  NOT NULL
   ,last_name  VARCHAR2 (30)  NOT NULL
   ,dept_id    NUMBER (4)     NOT NULL)
/
   
ALTER TABLE emp
   ADD CONSTRAINT  emp_pk
   PRIMARY KEY (id)
/

ALTER TABLE emp
   ADD CONSTRAINT emp_fk1
   FOREIGN KEY (dept_id)
   REFERENCES dept (id)
/

If we were to populate these tables with 10 DEPT entries and 100 EMP entries and run our delete:

SQL>INSERT INTO dept
  2  SELECT ROWNUM
  3  ,      'Dept_' || TO_CHAR(ROWNUM)
  4  FROM   dual
  5  CONNECT BY ROWNUM <= 10
  6  /

10 rows created.

SQL>INSERT INTO emp
  2  SELECT ROWNUM
  3  ,      'E_' || TO_CHAR(ROWNUM)
  4  ,      'Employee_' || TO_CHAR(ROWNUM)
  5  ,      MOD(ROWNUM,10)+1
  6  FROM   dual
  7  CONNECT BY ROWNUM <= 100
  8  /

100 rows created.

SQL>COMMIT
  2  /

Commit complete.

SQL>DELETE
  2  FROM   emp
  3  WHERE  dept_id IN (SELECT dept_id
  4                     FROM   dept
  5                     WHERE  dept_name = 'Dept_5')
  6  /

100 rows deleted.

SQL>SELECT *
  2  FROM   emp
  3  /

no rows selected

Ummm… where’s my data gone?

Let’s rewrite the query to show what really happened:

DELETE
FROM   emp e
WHERE  e.dept_id IN (SELECT e.dept_id 
                     FROM   dept d
                     WHERE  d.dept_name = 'Dept_5')

Now we can see what went wrong. The DEPT tables does not contain a column named DEPT_ID so Oracle has referenced the column in the EMP table. It can do this as identifiers in an outer query are visible inside a sub-query. Hence when Oracle is unable to find an identifier local to the sub-query then it will expand its search to the outer query.

Our delete query therefore will delete any EMP entry where the DEPT_ID is not null if there exists an entry in DEPT with a DEPT_NAME of ‘Dept_5’. Since DEPT_ID is defined as NOT NULL and there does exist a DEPT entry named ‘Dept_5’ we end up deleting all entries from EMP. A rather nasty outcome.

The best way to prevent this type of a problem is to always qualify column names with the table name or table alias. If we had done that with our DELETE query we would have got the following:

SQL>DELETE
  2  FROM   emp e
  3  WHERE  e.dept_id IN (SELECT d.dept_id
  4                       FROM   dept d
  5                       WHERE  d.dept_name = 'Dept_5')
  6  /
WHERE  e.dept_id IN (SELECT d.dept_id
                            *
ERROR at line 3:
ORA-00904: "D"."DEPT_ID": invalid identifier

Our qualified reference would have resulted in an error. Even better, if the DELETE was included as part of a package then the package would have thrown a compilation error.

Echoing my advice from the previous post: always use qualified references in your queries. The effort is negligible so there’s no excuse not to.