I had just settled down for a bite to eat for lunch when two developers approached me with rather guilty expressions. Clearly something was up and, given that there had been a significant deployment for the part of the system that they were responsible for the night before, I immediately suspected a production performance problem.
This was confirmed when they told me that one of the production processes was taking far too long; 20 to 30 seconds instead of sub-second. Their suspicion was that they had some sort of query performance problem and they needed help to identify and resolve it.
They gave me the name of the service and the two PL/SQL routines that were taking too long. Each routine invoked multiple SQL statements. Once again I turned to Active Session History, which tends to be my first tool of choice for this type of problem. I ran a query similar to the following to see what long running statements had originated from the service for the past hour:
SELECT session_id , session_serial# , program , machine , module , action , client_id , sql_id , sql_exec_id , COUNT(*) AS duration_secs , MIN(sample_time) AS start_exec FROM v$active_session_history WHERE sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss') AND sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss') AND session_type = 'FOREGROUND' AND program = :service_name GROUP BY program , session_id , session_serial# , machine , module , action , client_id , sql_id , sql_exec_id ORDER BY COUNT(*) DESC
The query result showed a lot of 2 and 3 second queries with just two SQL_IDs listed. Taking these two SQL_IDs, I ran a query like:
SELECT * FROM v$active_session_history WHERE sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss') AND sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss') AND sql_id = :sql_id ORDER BY sample_time
The event details showed that the queries were spending their time on CPU so no contention or other wait problems that would point to some system-wide issue. More telling though was the SQL_PLAN_OPERATION and SQL_PLAN_OPTIONS. These details showed full table scans were being done.
With this information it was time to find out where the SQL statements were and where they resided, using the following query:
SELECT s.sql_text , (SELECT o.object_name FROM dba_objects o WHERE s.program_id = o.object_id) AS program_name , s.program_line# FROM v$sql WHERE sql_id = :sql_id
and what their execution plans were:
SELECT * FROM TABLE(dbms_xplan.display_cursor(:sql_id,:sql_child_number))
As it turned out both SQL statements were relatively simple SELECTs along the lines of:
SELECT t1.col_1 , t1.col_2 , t2.col_1 , t2.col_2 FROM t1 , t2 WHERE t1.id = t2.id AND (:param1 IS NULL OR t1.p1 = :param1) AND (:param2 IS NULL OR t2.p2 = :param2)
The tables had indexes on the ID, P1 and P2 columns. While there were two variables used by the queries the developers explained that only one would be specified and the other would be NULL. The execution plan Oracle had chosen, running under version 220.127.116.11, was to table scan both tables and hash join them.
The queries were rather nice examples of a developer trying to do too much in a single query and performance suffering as a result. Each query were designed to satisfy two distinct scenarios. While those two scenarios result in very similar queries it had forced Oracle into a position where it has no choice but to scan the tables involved. The developer had hoped that Oracle would choose the relevant index, on P1 or P2, depending on what parameters were provided, i.e. Oracle would separate out the two different scenarios at run time.
The solution to this problem was to rewrite the queries separating out the different scenarios. One way was to use a UNION query:
SELECT t1.col_1 , t1.col_2 , t2.col_1 , t2.col_2 FROM t1 , t2 WHERE t1.id = t2.id AND t1.p1 = :param1 UNION SELECT t1.col_1 , t1.col_2 , t2.col_1 , t2.col_2 FROM t1 , t2 WHERE t1.id = t2.id AND t2.p2 = :param2
In the above whichever parameter is NULL will result in no rows being returned from that part of the query and so the result will only be from the part where the parameters is specified. Best of all Oracle would be able to optimise the query to use indexes for each part and so the performance problem would be overcome.
Once I pointed out the problem to the developers they elected to go with a slightly different solution though. The query was SELECT INTO query contained within the PL/SQL package so they simply split it using conditional logic:
IF l_param1 IS NOT NULL) THEN SELECT t1.col_1 , t1.col_2 , t2.col_1 , t2.col_2 INTO l_var_1 , l_var_2 , l_var_3 , l_var_4 FROM t1 , t2 WHERE t1.id = t2.id AND t1.p1 = l_param1; ELSE SELECT t1.col_1 , t1.col_2 , t2.col_1 , t2.col_2 INTO l_var_1 , l_var_2 , l_var_3 , l_var_4 FROM t1 , t2 WHERE t1.id = t2.id AND t2.p2 = l_param2; END IF;
The change was a simple rewrite to the two problematic routines. As the developers had heeded advice about avoiding session state the change was easily deployed into the production system without requiring an outage. Performance for both routines dropped from 2 to 3 seconds to around 1 millisecond.
I went back to lunch…