I am regularly asked how to go about diagnosing and troubleshooting Oracle performance problems, which is then followed up with “how did you work that out?” after the problem has been resolved. These are the very same questions I asked when I started out working with Oracle. As a result I thought I’d start making posts outlining particular problems I encounter, the steps taken to diagnose cause of the problem and the solution. “Oracle Voodoo” is the term the developers use where I work as, when first introduced to Oracle, they lacked understanding of how the database worked and so, jokingly, equated performance troubleshooting to magic.
The following problem started with a call from another developer about “a service that’s timing out”. The key bits of information provided were the name of the service, the business function it was trying to perform and a relatively small time period when the problem had occurred. The time-out for the service was 60 seconds. Since the business function was to simply retrieve some data from the database it was assumed that the problem lay inside Oracle as opposed to the services layer or network.
Given the above information I turned to Active Session History to help me out. The one second sampling used by ASH made it a perfect source of data for this type of problem. The system is well instrumented with the Client Identifier set for individual calls, which makes tracing calls by a service easy even when connection pooling is used. Of course, in order to use Active Session History you must be licensed for it.
I ran a query like the following to determine if there were any long running SQL statements around the time of the reported problem:
SELECT session_id , session_serial# , program , machine , module , action , client_id , sql_id , sql_exec_id , COUNT(*) AS duration_secs , MIN(sample_time) AS start_exec FROM v$active_session_history WHERE sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss') AND sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss') AND session_type = 'FOREGROUND' GROUP BY program , session_id , session_serial# , machine , module , action , client_id , sql_id , sql_exec_id ORDER BY COUNT(*) DESC
As the above query groups by SQL_ID and SQL_EXEC_ID it reports the duration of each query invocation. As luck would have it the query did highlight sessions originating from the problem service that exceeded 60 seconds.
Taking one of the sessions I examined the ASH details for the long running query:
SELECT * FROM v$active_session_history WHERE sample_time >= TO_TIMESTAMP(:start_date_time,'dd/mm/yyyy hh24:mi:ss') AND sample_time <= TO_TIMESTAMP(:end_date_time,'dd/mm/yyyy hh24:mi:ss') AND client_id = :client_id AND sql_id = :sql_id ORDER BY sample_time
The EVENT details reported by the above query showed no contention, just normal IO operations. So we had a query that was running smoothly but waaaaay too long. My next step was to look at the SQL itself and where it originated from:
SELECT s.sql_text , (SELECT o.object_name FROM dba_objects o WHERE s.program_id = o.object_id) AS program_name , s.program_line# FROM v$sql WHERE sql_id = :sql_id
All SQL statements in the application are contained within packages so the above query identifies which package and the line number where it resides. Normally in a situation like this I would also have looked at the query execution plan and possibly compared it to the plans stored in AWR to see if there had been a plan change. However in this situation I was aware that a table referenced by the query had recently been partitioned. A quick check showed that the query had not been modified to correctly reference the partitioning key of the table and was therefore performing an index lookup against all the partitions of a local index. Problem identified.
However a patch was required to the code and the query resided in a package that was pivotal to the operation of the system. The system was mission critical to the business and could not be brought down. Thankfully attention had been given to avoiding package state so, after fixing the query and running a quick test of the patched code in a test environment, a new version of the package was deployed straight into the live system without causing any trouble.
Problem solved. From taking the call to deploying the new code took just over 30 minutes, and half of that time was filling in the appropriate change control “paperwork”.