One of our developers called me recently and said that they had an interesting problem on their hands.
The essence of a problem was that the schema upgrade script misfired and did not create a very important index on one of the largest tables. As a result, critical queries that were supposed to hit that index were now doing full table scans and dragging on for minutes.
The weird thing was that a developer, realizing her mistake, connected to the schema and created the index manually. Yet even after that was done, target queries still full table scanned the table, completely ignoring the index.
I actually like these types of problems. To the uninitiated, they seem puzzling and illogical, but in truth, the perennial “Why my index is being ignored by ORACLE“ question is probably is old as ORACLE itself. Almost always, there are obvious (and well known) reasons, such as function being (mis)used or incorrect statistics. And to top it off, it usually does not take much time or effort to fix (or, at least, explain) the problem and that makes the DBA look like a magician in developer’s eyes … and establishes a great foundation for a healthy DBA/developer relations 🙂
This particular problem, however, turned out to be a real puzzle.
Problem Investigation …
Here is what we discovered after a little bit of analysis:
- The query was actually a very simple lookup SQL searching for a small range of very close index values. A bit simplified, the query was as simple as this: SELECT … FROM target_table WHERE index_column BETWEEN :v1 AND :v2
- The query had no hints (Stupid things like that are usually the first thing to check 🙂 )
- Target index was Unique (Discard: Skewed data as a potential index not being used cause)
- Table data was ordered in roughly the same way as the target index (Discard: index clustering factor as a potential cause)
- Column being searched for was the leftmost column in the index (Discard another well known potential cause)
- Table was fairly large (a few 100s of Gbs) and much larger than index. Both table and index statistics were just collected and represented sizes adequately (Discard: bad statistics as a potential cause)
- Bind variable data type was the same as column data type (Discard: data type mismatch as a cause)
- Index was valid and visible (Discard: the new 11g potential cause)
- And finally, optimizer environment was normal (database default, actually), database parameters were NOT set to heavily favor indexes etc
- Oh, and, yes, there were no active outlines
In other words, there was nothing at all preventing ORACLE from using the index and yet, query was still stubbornly choosing a full table scan. The even weirder thing was that if the query was modified ever slightly, such as when a meaningless hint was added, it used the index (which confirmed again that the index was good after all).
The first hint of what was wrong came after analyzing DBMS_XPLAN output (yes, this new package rocks). In particular, its Note section contained the following message:
- SQL plan baseline "SQL_PLAN_5fsqeed348ds451ecae3d" used for this statement
Digging a bit deeper, we found that there was an active SPM baseline for our SQL with two plans: older (and accepted) plan for a full table scan and newer (and, based on the cost, better) but not accepted plan that used the index.
ORDER BY signature, optimizer_cost
SYS_SQL_369ec5af2a139ed8 SQL_PLAN_3d7q5pwp177qs349583b6 3 NO
SYS_SQL_369ec5af2a139ed8 SQL_PLAN_3d7q5pwp177qs94ecae5c 512 YES
But where exactly did this come from? We just ported our software to 11gR2 and I was sure we did not create any baselines (at least, not manually) …
Finally, it became clear – one of the DBAs set the new 11g optimizer_capture_sql_plan_baselines parameter to TRUE and that caused all the repeated SQLs to generate (and save) SPM baselines. And once SPM baseline was generated (and since parameter optimizer_use_sql_plan_baselines was also TRUE by default) our SQL sticked to it refusing to accept any new (and in this case much better) execution plans.
What the … or Why should we even need SPM ?
This is not a trivial question. Think about it slowly:
- A database generates execution plan for SQL statement. Let’s say its cost is: 25,000
- Some time after that the statement is re-parsed and (for whatever reason) the database generates a better execution plan. Let’s say its cost is: 5
- Optimizer knows that second plan is better as 25,000 >> 5
- And yet, it sticks with the original sucker refusing to change
How ridiculous is that ? So why would ORACLE (not to mention, anyone in their right mind) even consider it ?
Well, there are a couple of reasons why SPM does what it does and they are best explained by looking at a couple of underlying assumptions that many people make about ORACLE.
Let’s start with something that everyone can, hopefully, agree on.
The goal of ORACLE is to execute SQL statements in the fastest and most efficient way possible
This sounds like a good goal but in order for it to be practical we need to make one underlying assumption here:
ORACLE database knows the best way to execute SQL statements BEFORE THEY ARE EXECUTED
In other words, we are expecting the database to predict future performance. But is it realistic ?
Most people would say: of course, it is realistic ! What you are talking about here is done by ORACLE SQL Optimizer, which can not only chose the best method to execute any particular SQL statement (among many potential methods) but can also assign to this method a very precise and specific number (a.k.a. Optimizer Cost) that will be a measure of its future performance. After all, they do not call this database “ORACLE” just for the looks, you know … 😉
To put it directly:
For all intents and purposes, Optimizer Cost is the measure of future SQL performance
Therefore, when it comes to comparing execution methods for the same SQL statement, we can easily see that the method (or in optimizer lingo: Plan) with cost=5,000 will perform better than the one with cost=8,000, and, of course, we will chose the one with the smaller cost.
However, just to keep things honest, let’s modify our second assumption slightly:
Optimizer Cost is the measure of (what Optimizer thinks) future SQL performance (would be)
The THINKS part here is important. Remember that, as with any predictions, SQL optimizer does not really know what the future performance will be. The cost is calculated based on some mathematical modeling and some abstracted (read: simplified) data descriptions that we know as statistics.
So, let’s take the optimizer for what it really is – a mathematical machine and agree that machines decision making (I’m thinking of my Roomba here) is sometimes, eh … questionable …
Keeping that in mind, the “best” execution plan that optimizer comes up with, might indeed result in the best possible SQL performance (and it really should in most cases)… OR it might result in a horrible performance because critical statistics were stale … or, perhaps, because the data distribution was so complex that the optimizer simply could not cut it …
The bottom line here is that:
Optimizer cost is NOT a 100% reliable measure of future SQL performance
And we are faced with a dilemma here:
How to guarantee that SQLs will be executed in a most efficient way knowing that we cannot predict their future performance with certainty ?
Well, as most politicians would attest to, sometimes the best way to deal with the problem is to “redefine” it to something that can actually be solved …
We will do the same here and try a different approach.
Instead of predicting what SQL performance would be (which is inherently unreliable), why don’t we simply run the SQL in question and measure its actual performance. In this case, we are, in a way, moving from guessing to knowing and this knowledge, by definition, will be more reliable.
And this is, very simply, what new ORACLE 11g feature SQL Plan Management (or SPM) is designed to do.
The goal of SPM is to only allow SQL plans that have been PROVEN WORTHY by REAL execution statistics
The end result is, with SPM in place, in order for the execution plan to be considered (or ACCEPTED in SPM lingo), it has to be actually executed and proven to PERFORM better than the other known plans for this statement.
In other words, while SQL optimizer is still free to philosophize about what future SQL performance will look like, SPM, in a way requires optimizer to “Put the money where his mouth is”:
|Without SPM||Optimizer||A sole decision maker||Optimizer decides how SQL will be executed|
|With SPM||Optimizer||An advisor only||Optimizer finds new execution plans that may be potentially better|
|SPM||A censor / Final decision maker||SPM only allows SQL plans that have been proven to perform better for real|
SPM changes the overall ORACLE “SQL execution” behavior from “Hey, I found a new cool way to run this statement, let’s try it now on a production system …” to more in line with, “If it works well – do not screw with it! Unless you are really really sure it is going to help …”.
Is SPM a foolproof solution ? Not really (none of the solutions are) … SPM can still badly misfire and chose a horrible execution plan over a good one (as we saw in the beginning of this post). Plus, it means additional work for DBAs.
However, the really good thing about SPM is that it makes ORACLE database more logical and predictable with its “Though shall not harm” philosophy. It does so by giving a lot more control over SQL execution back to the operator.
And this will ultimately lead to a fewer nasty surprises in application performance down the road (well, if application performance started to suck AFTER you explicitly changed something, then this is not a surprise, is it ? At least, not nasty … 🙂 )
I guess we can see now that the new ORACLE SPM feature is useful as it makes database performance more stable. But how exactly does it work ? Can we control it ? … Stay tuned as we will discuss all of it in Part 2.