Data Outliers

One of the most common questions a DBA will receive from developers is “Why isn’t Oracle using an index for this query?”. Sometimes the underlying reason is complex but quite often it’s fairly simple. This tends to come down to a either a misunderstanding by the developer as to when indexes should be used or a misunderstanding by Oracle as to what the data “looks like”. Let’s take a look at the latter situation…

When optimising a query Oracle works out a cost for the various operations it could perform and picks the operation with the lowest cost. The critical part from Oracle’s perspective is determining the notion of cost and central to the cost calculations are the various statistics Oracle stores regarding the data. Oracle could dynamically assess the data profile when optimising a query and in some situations will do exactly that (see dynamic sampling) but in the interests of efficiency Oracle generally relies on statistics that have previously been collected.

Rather than rambling on about Oracle’s query optimiser I’ll outline a very simple situation encountered recently whereby Oracle chose to do a full table scan on a rather large table when logic would indicate that an index access path would be more efficient. Oracle made its decision based on a simplified understanding of the data, which was easy to diagnose when presented with details from the proposed execution plan. The situation presented was:

  • The table in question contained “events”. There were 40 to 50 million events dated back 60 years or so.
  • The table was not partitioned and an index existed on the event date.
  • The query was a simple selection of the events from the last 7 days, i.e.
    SELECT * FROM table_x WHERE event_date >= SYSDATE - 7
  • According to the cardinality details in the execution plan, the query was going to return over 40% of the data from table.

The last point was clearly where Oracle was making its cost calculation mistake, which resulted in its desired for a table scan. Unless the data is massively skewed, selecting the last 7 days of data from a table where the data goes back 60 years should return only a small fraction of the total data.

Histograms relating to the data distribution are what inform Oracle about expected cardinalities from query predicates. Without a histogram Oracle has to resort to more simplistic calculations of cardinality that involve its understanding of the minimum and maximum values. These simplistic calculations assume linear distribution of the data.

Knowing the above it was fairly simple to diagnose the problem to be a lack of a histogram on the event date column of the table and one or more events that were erroneously dated far into the future. Both of these were quickly confirmed and another problem was solved.

Let’s illustrate what went wrong with a simple test case. First we’ll create a table with an indexed date column, which we’ll populate with one entry for every day since 1950. After gathering statistics on the table, without a histogram on the date column, we’ll generate the execution plan for a query that selects the rows from the table for the past 7 days (I’m running this example on an 11.2 database in mid-July 2013 so don’t be surprised if you try things out and get slightly different cost/cardinality results):

SQL>CREATE TABLE date_tab
  2     (date_val    DATE     NOT NULL
  3     ,padding     CHAR(30) NOT NULL)
  4  /

Table created.

SQL>INSERT INTO date_tab
  2     (date_val, padding)
  3  SELECT TO_DATE('01/01/1950','dd/mm/yyyy') + (ROWNUM-1)
  4  ,      'X'
  5  FROM   dual
  6  CONNECT BY ROWNUM <= TRUNC(SYSDATE) - TO_DATE('01/01/1950','dd/mm/yyyy')
  7  /

23202 rows created.

SQL>COMMIT
  2  /

Commit complete.

SQL>EXEC dbms_stats.gather_table_stats ('','date_tab',method_opt=>'FOR ALL COLUMNS SIZE 1')

PL/SQL procedure successfully completed.

SQL>CREATE INDEX date_tab_i1
  2     ON date_tab (date_val)
  3  /

Index created.

SQL>EXPLAIN PLAN
  2  FOR
  3  SELECT *
  4  FROM   date_tab
  5  WHERE  date_val >= TRUNC(SYSDATE) - 7
  6  /

Explained.

SQL>SELECT *
  2  FROM   TABLE (dbms_xplan.display())
  3  /

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------
Plan hash value: 1270297285

-------------------------------------------------------------------------------------------
| Id  | Operation                   | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |             |     7 |   273 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATE_TAB    |     7 |   273 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | DATE_TAB_I1 |     7 |       |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("DATE_VAL">=TRUNC(SYSDATE@!)-7)

14 rows selected.

Oracle has estimated that it will return 7 rows for our SELECT query, which given the way we populated the table is exactly right. Let’s see what happens when we add a single entry for the year 3000 into the table and gather stats again without a histogram:

SQL>INSERT INTO date_tab
  2  VALUES (TO_DATE('31/12/3000','dd/mm/yyyy'),'X')
  3  /

1 row created.

SQL>COMMIT
  2  /

Commit complete.

SQL>EXEC dbms_stats.gather_table_stats ('','date_tab',method_opt=>'FOR ALL COLUMNS SIZE 1')

PL/SQL procedure successfully completed.

SQL>EXPLAIN PLAN
  2  FOR
  3  SELECT *
  4  FROM   date_tab
  5  WHERE  date_val >= TRUNC(SYSDATE) - 7
  6  /

Explained.

SQL>SELECT *
  2  FROM   TABLE (dbms_xplan.display())
  3  /

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------
Plan hash value: 342710661

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          | 21802 |   830K|    41   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| DATE_TAB | 21802 |   830K|    41   (3)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("DATE_VAL">=TRUNC(SYSDATE@!)-7)

13 rows selected.

Oh. While our query will return just one extra row, Oracle has elected to do a full table scan. The proposed execution plan shows why: Oracle thinks the query will return 21,802 rows out of the total of 23,203! We can dig into the statistics to see how Oracle arrived at this conclusion. First off, we can get the number of rows Oracle think is in the table:

SQL>SELECT num_rows
  2  FROM   user_tables
  3  WHERE  table_name = 'DATE_TAB'
  4  /

  NUM_ROWS
----------
     23203

Next we can look into some statistics that describes the DATE_VAL column:

SQL>SELECT num_distinct
  2  ,      num_nulls
  3  ,      high_value
  4  ,      low_value
  5  ,      histogram
  6  FROM   user_tab_col_statistics
  7  WHERE  table_name = 'DATE_TAB'
  8  AND    column_name = 'DATE_VAL'
  9  /

NUM_DISTINCT  NUM_NULLS HIGH_VALUE     LOW_VALUE      HISTOGRAM
------------ ---------- -------------- -------------- ----------
       23154          0 82640C1F010101 77960101010101 NONE

Oracle thinks that there are 23,154 unique date values spread across the 23,203 rows that it thinks is in the table. Oracle also thinks that there are 0 NULL values, which is obvious given that the column is defined as NOT NULL. The HISTOGRAM setting tells us that Oracle has no understanding of how data is distributed between the high and low columns values. This is the critical flaw to how Oracle arrived at it’s decision to do a full table scan. The HIGH_VALUE and LOW_VALUE entries are hexadecimal representations of the maximum and minimum date values in the table. We can confirm this by using the DUMP function on the minimum and maximum table values themselves:

SQL>SELECT DUMP(MIN(date_val)) AS min_date_dump
  2  ,      DUMP(MAX(date_val)) AS max_date_dump
  3  FROM   date_tab
  4  /

MIN_DATE_DUMP                       MAX_DATE_DUMP
----------------------------------- -----------------------------------
Typ=12 Len=7: 119,150,1,1,1,1,1     Typ=12 Len=7: 130,100,12,31,1,1,1

Convert the comma separated list of numbers to hexadecimal and you’ll see that they are the LOW_VALUE and HIGH_VALUE values from the USER_TAB_COL_STATISTICS query.

So, Oracle’s understanding of our data is that we’ve got 23,203 rows with dates spread evenly from 1 January 1950 to 31 December 3000. So, when we run a query that asks for all data from July 2013 onwards it calculates that to be most of the table.

What can we do to help Oracle make better execution plan decisions? The obvious answer here is to provide it with some understanding of how the dates in the table are distributed, i.e. a histogram. Here’s what happens when we create a histogram on the date value and get Oracle to re-evaluate the execution plan:

SQL>EXEC dbms_stats.gather_table_stats ('','date_tab',method_opt=>'FOR ALL INDEXED COLUMNS SIZE AUTO')

PL/SQL procedure successfully completed.

SQL>EXPLAIN PLAN
  2  FOR
  3  SELECT *
  4  FROM   date_tab
  5  WHERE  date_val >= TRUNC(SYSDATE) - 7
  6  /

Explained.

SQL>SELECT *
  2  FROM   TABLE (dbms_xplan.display())
  3  /

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------
Plan hash value: 1270297285

-------------------------------------------------------------------------------------------
| Id  | Operation                   | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |             |    91 |  3549 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATE_TAB    |    91 |  3549 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | DATE_TAB_I1 |    91 |       |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("DATE_VAL">=TRUNC(SYSDATE@!)-7)

14 rows selected.

Oracle is now electing to do the index scan that we would hope for. The execution plan reports that it’s expecting 91 rows back, which we know is inaccurate. We could stop here with our analysis as Oracle is now choosing to use the execution plan we would hope for. However, let’s see if we can understand more about why it now thinks the query will return 91 rows.

Rerunning out query against USER_TAB_COL_STATISTICS, we see that Oracle is now reporting a height balanced histogram on the DATE_VAL column:

SQL>SELECT num_distinct
  2  ,      num_nulls
  3  ,      high_value
  4  ,      low_value
  5  ,      histogram
  6  FROM   user_tab_col_statistics
  7  WHERE  table_name = 'DATE_TAB'
  8  AND    column_name = 'DATE_VAL'
  9  /

NUM_DISTINCT  NUM_NULLS HIGH_VALUE     LOW_VALUE      HISTOGRAM
------------ ---------- -------------- -------------- ---------------
       23154          0 82640C1F010101 77960101010101 HEIGHT BALANCED

We can see the histogram buckets by querying USER_TAB_COL_HISTOGRAMS:

SQL>SELECT endpoint_number
  2  ,      endpoint_value
  3  ,      TO_DATE(endpoint_value,'J') AS endpoint_date
  4  FROM   user_tab_histograms
  5  WHERE  table_name = 'DATE_TAB'
  6  AND    column_name = 'DATE_VAL'
  7  ORDER  BY
  8         endpoint_number
  9  /

  ENDPOINT_NUMBER    ENDPOINT_VALUE ENDPOINT_DATE
----------------- ----------------- ----------------
                0           2433286 04-Jan-1950
                1           2433371 30-Mar-1950
                2           2433488 25-Jul-1950
                3           2433604 18-Nov-1950
-- snip lots of rows unnecessary for a blog post --
              250           2456159 19-Aug-2012
              251           2456216 15-Oct-2012
              252           2456304 11-Jan-2013
              253           2456387 04-Apr-2013
              254           2817152 31-Dec-3000

255 rows selected.

Given that this query was run in July 2013, Oracle can deduce that our original query will eliminate approximately 254/255 rows from the table. If we take the number of distinct values Oracle thinks DATE_VAL contains, 23,154, and multiple that by 1/255 we get 90.8. Rounded to the nearest whole number gives us the 91 that Oracle reports in the explain plan output.

In summary, when generating an execution plan Oracle makes a series of mathematical calculations to determine the relative cost of the various operations it could potentially perform and selects the option that it believes to be the lowest cost. These calculations are based on an understanding of the characteristics of the data as specified by various statistics. If the statistics are wrong or simply missing then Oracle stands a good chance of making incorrect cost calculations and arriving at an inefficient execution plan. For further reading I would strongly recommend the book Cost Based Oracle Fundamentals by Jonathan Lewis.

So, next time you encounter a query that does not use the index that you would expect take a bit of time to look into why Oracle chose the execution plan. It might be that Oracle needs to have some extra information made available to it by way of statistics. Failing all else, you will gain a better understanding of how the Oracle query optimiser operates, which can never be a bad thing for an Oracle developer.

Lastly, I hope this long winded example has helped show that just one single bad data entry has the potential to cause havoc for your database!

Advertisements

Oracle lies!

“Everybody lies” was the common phrase used by Dr Gregory House in the TV show House MD. It seems Oracle is no different…

I was recently asked to help a developer tune a change they had made to a query. The query was relatively lengthy given that it was for an OLTP system, hence the need to ensure it ran as efficiently as possible.

Running the original query through Autotrace generated approximately 1,700 logical IO operations, which is one of the critical metrics I use for quickly judging efficiency. Autotrace on the modified query initially reported twice the logical IOs that was quickly identified to be a problem with the way Oracle was ordering the tables. A simple application of the LEADING hint brought the logical IO down… but only to around 1,800, which was 100 higher than the original query.

For the next hour I struggled to work out why the modified query was reporting a higher logical IO figure than the original query. Yes, the query had been modified but the change was the replacement of a subquery that returned a single value with a bind variable holding that value. My expectation was that the modified query should be 3 IO operations less than the original query (just to note, the change to the query really wasn’t worth it from a query tuning perspective but was primarily being done for other reasons).

After applying hints that forced the modified query to behave as closely as possible to the original query the logical IO was still 100 greater. Verification of the result-set showed that both queries did return the same results too, which is another thing to verify when performing any sort of query tuning.

My next step was to examine the execution of both queries using SQL monitor and this was when I discovered that Oracle had been lying to me all along. At the core of the query were about half a dozen tables. The execution plan reported by Autotrace joined them in a particular order and the hints I had applied to the modified query mimicked this order. However, the SQL plan details as revealed by SQL monitor showed that Oracle joined the tables in a different order, choosing to lead with a different table.

I had fallen into the one trap with Autotrace execution plan that I had noted in my earlier blog post:

The main point to note is that the execution plan is simply the proposed plan and in certain situations the database may choose to execute the query differently. For example, the execution plan may be altered as a result of bind variable peeking.

Another way of uncovering the difference in the predicted and actual plans would have been to extract the real plan from the SGA, using dbms_xplan.display_cursor.

It’s not common for the execution plan reported by Explain Plan or Autotrace to be different to the one that Oracle will actually use when executing the query but it is possible. Perhaps I should have been open to the possibility of this earlier and saved myself a bit of angst. Simply hinting the modified query with the different table order resolved the problem of the IO difference. I went home that day happy but mentally cursing Oracle for lying to me and leading me astray in the first place.