Oracle performance problem case study 2

I had just settled down for a bite to eat for lunch when two developers approached me with rather guilty expressions. Clearly something was up and, given that there had been a significant deployment for the part of the system that they were responsible for the night before, I immediately suspected a production performance problem.

This was confirmed when they told me that one of the production processes was taking far too long; 20 to 30 seconds instead of sub-second. Their suspicion was that they had some sort of query performance problem and they needed help to identify and resolve it.

They gave me the name of the service and the two PL/SQL routines that were taking too long. Each routine invoked multiple SQL statements. Once again I turned to Active Session History, which tends to be my first tool of choice for this type of problem. I ran a query similar to the following to see what long running statements had originated from the service for the past hour:

SELECT session_id
,      session_serial#
,      program
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
,      COUNT(*) AS duration_secs
,      MIN(sample_time) AS start_exec
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    session_type = 'FOREGROUND'
AND    program = :service_name
GROUP  BY
       program
,      session_id
,      session_serial#
,      machine
,      module
,      action
,      client_id
,      sql_id
,      sql_exec_id
ORDER  BY
       COUNT(*) DESC

The query result showed a lot of 2 and 3 second queries with just two SQL_IDs listed. Taking these two SQL_IDs, I ran a query like:

SELECT *
FROM   v$active_session_history
WHERE  sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss')
AND    sql_id    = :sql_id
ORDER  BY
       sample_time

The event details showed that the queries were spending their time on CPU so no contention or other wait problems that would point to some system-wide issue. More telling though was the SQL_PLAN_OPERATION and SQL_PLAN_OPTIONS. These details showed full table scans were being done.

With this information it was time to find out where the SQL statements were and where they resided, using the following query:

SELECT s.sql_text
,      (SELECT o.object_name FROM dba_objects o WHERE s.program_id = o.object_id) AS program_name
,      s.program_line#
FROM   v$sql
WHERE  sql_id = :sql_id

and what their execution plans were:

SELECT *
FROM   TABLE(dbms_xplan.display_cursor(:sql_id,:sql_child_number))

As it turned out both SQL statements were relatively simple SELECTs along the lines of:

SELECT t1.col_1
,      t1.col_2
,      t2.col_1
,      t2.col_2
FROM   t1
,      t2
WHERE  t1.id = t2.id
AND    (:param1 IS NULL OR t1.p1 = :param1)
AND    (:param2 IS NULL OR t2.p2 = :param2)

The tables had indexes on the ID, P1 and P2 columns. While there were two variables used by the queries the developers explained that only one would be specified and the other would be NULL. The execution plan Oracle had chosen, running under version 11.2.0.4, was to table scan both tables and hash join them.

The queries were rather nice examples of a developer trying to do too much in a single query and performance suffering as a result. Each query were designed to satisfy two distinct scenarios. While those two scenarios result in very similar queries it had forced Oracle into a position where it has no choice but to scan the tables involved. The developer had hoped that Oracle would choose the relevant index, on P1 or P2, depending on what parameters were provided, i.e. Oracle would separate out the two different scenarios at run time.

The solution to this problem was to rewrite the queries separating out the different scenarios. One way was to use a UNION query:

SELECT t1.col_1
,      t1.col_2
,      t2.col_1
,      t2.col_2
FROM   t1
,      t2
WHERE  t1.id = t2.id
AND    t1.p1 = :param1
UNION
SELECT t1.col_1
,      t1.col_2
,      t2.col_1
,      t2.col_2
FROM   t1
,      t2
WHERE  t1.id = t2.id
AND    t2.p2 = :param2

In the above whichever parameter is NULL will result in no rows being returned from that part of the query and so the result will only be from the part where the parameters is specified. Best of all Oracle would be able to optimise the query to use indexes for each part and so the performance problem would be overcome.

Once I pointed out the problem to the developers they elected to go with a slightly different solution though. The query was SELECT INTO query contained within the PL/SQL package so they simply split it using conditional logic:

IF l_param1 IS NOT NULL) THEN
   SELECT t1.col_1
   ,      t1.col_2
   ,      t2.col_1
   ,      t2.col_2
   INTO   l_var_1
   ,      l_var_2
   ,      l_var_3
   ,      l_var_4
   FROM   t1
   ,      t2
   WHERE  t1.id = t2.id
   AND    t1.p1 = l_param1;
ELSE
   SELECT t1.col_1
   ,      t1.col_2
   ,      t2.col_1
   ,      t2.col_2
   INTO   l_var_1
   ,      l_var_2
   ,      l_var_3
   ,      l_var_4
   FROM   t1
   ,      t2
   WHERE  t1.id = t2.id
   AND    t2.p2 = l_param2;
END IF;   

The change was a simple rewrite to the two problematic routines. As the developers had heeded advice about avoiding session state the change was easily deployed into the production system without requiring an outage. Performance for both routines dropped from 2 to 3 seconds to around 1 millisecond.

I went back to lunch…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s