Let’s say that you have a very important SQL in your system.
How do you know if it is performing well ? Is it running slow sometimes resulting in users having bad experience ? If so, when ? How many users ? And just exactly how bad does user experience get when it does get “bad” ?
In this series of articles I’m going to talk about tracking and measuring performance of individual SQLs. We’ll look at the standard metrics that ORACLE provides to do that and introduce some new (and hopefully better) metrics that we can use.
The goal here is to be able to answer the questions above with as much precision as possible.
Quantifying user experience
Performance measurement is not useful unless it is specific.
With that in mind, let me start with a concrete example to illustrate the problem better. Let’s say we need to track performance for the following UPSERT statement:
ON (dual.dummy IS NOT NULL AND id = :1 AND p_id = :2
AND order_id = :3 AND relevance = :4 AND …
We know (by previous measurements) that this statement typically runs in under 100 milliseconds, and we want to be alerted if it takes more than 200 milliseconds to run (this is how we define bad experience).
If we have a way to measure this statement’s performance (a.k.a. “latency”) periodically and plot it on screen, we (and more importantly, our automatic tools) should easily achieve that goal.
Measuring SQL performance – a straightforward (but naive) way
So how exactly do we measure SQL latency ?
At a first blush this doesn’t sounds like a very difficult problem to solve. ORACLE is a very well instrumented database and, of course, it already collects SQL latency metrics (“elapsed time”) and exposes them through v$SQL and v$SQLSTATS views. There are also historical SQL performance records coming from AWR and available in DBA_HIST_SQLSTAT view.
Say that we look at one of the AWR reports (which queries DBA_HIST_SQLSTAT) for our UPSERT query.
Time (s) Executions per Exec (s) %Total %CPU %IO SQL Id
---------------- -------------- ------------- ------ ------ ------ -------------
635.5 10,090 0.1 31.5 16.5 77.6 fskp2vz7qrza2
merge into orders_table using dual on (dual.dummy is not null and id = :1
and p_id = :2 and order_id = :3 and relevance = :4 and …
The average elapsed time for this query (in this AWR sample) is 0.1 seconds, or 100 milliseconds. Our “bad experience” target is 200 milliseconds and we are way below that. So, at the moment, our SQL must be performing well, right ?
Not so fast. The devil, of course, is always in the details.
AWR only records “total elapsed time” for the time period, which means that 100 milliseconds is an “average” (a.k.a. “mean”) elapsed time across 10,090 executions.
Ask yourself this:
- Is it possible that at least some of 10,000 executions are running longer than 200 milliseconds ? How about longer than 200 seconds ?
- If the answer is YES, how many executions are we talking about ? Can the number of such “slow” statements be non-trivial, i.e. 10% ? 20% ? What about 40% ?
Deep diving on “averages” – are they really that bad ?
Because of these considerations, “average” metric often gets a bad rap. It’s efficiency is often sarcastically illustrated by this joke of the day:
It’s like measuring an “average” temperature of all patients in a hospital …
clearly a wrong thing to do!
But is “average” metric really that bad ?
The idea of a summarized metric (such as “average”) is to choose 1 point out of many that best characterizes data distribution (so that we do not need to store and process all of the RAW data for our performance analysis).
Let’s build a model for our example. Let’s assume that elapsed times are random (a very reasonable assumption) and model elapsed times for each individual SQL execution. If we plot our times on the screen, we are likely to see a point cloud that looks like this:
If you were to choose a single “elapsed time” point from this cloud – what would it be ?
Believe it or not, nature has an answer for that!
Notice that the points in our plot are not distributed haphazardly – the point cloud gets denser as it goes towards the center.
This is not a fluke – random processes quite often produce value distributions that are not completely random. Instead, they tend to form patterns and congregate their values around certain objects (points, lines etc) that mathematicians call attractors.
One of the most common distribution patterns is to have just one attractor right down in the center. In fact, it is so typical, that statisticians call this pattern a normal distribution.
I will transform this pattern slightly to show a more familiar shape by plotting frequencies (or probabilities) of different values in the cloud:
And lo and behold: a famous bell curve of a normal distribution.
Notice that red line in the center, at 100 milliseconds ? It’s the attractor that I’ve been talking about, the most common value in our distribution and, as it turns out, it is also the “average” value.
In other words, in our model (which Wikipedia claims – covers a great many natural random processes), “average” metric turns out to be very meaningful and certainly a good summary point to describe the data.
Average = “most typical value” (when normal)
Can we say anything about how many queries are crossing 200 millisecond threshold ?
It turns out that we can do it pretty easily if we know (or can estimate) one additional metric: a standard deviation (how “wide” the bell shape is). I.e. assuming stdev=40 (40% of the “average” value – a pretty generous number), the probability of crossing 200 millisecond threshold turns out to be 0.6%.
So, ~ 60 (10,000 * 0.006) of our queries are expected to result in bad experience for our users.
This probability can be easily visualized using the normal curve:
R code to reproduce this example:
# Generate normally distributed data
d <- data.frame(ETIME = rnorm(mean=100, sd=40, n=10000))
d$N <- 1:nrow(d)
d$PROB <- dnorm(d$ETIME, mean=100, sd=40)
# Point cloud
ggplot(d, aes(N, ETIME)) + geom_point(color="grey") +
theme_minimal() + theme(axis.text.x=element_blank()) +
xlab("Time") + ylab("Elapsed time (milliseconds)")
# Bell curve
ggplot(d, aes(ETIME, PROB)) + geom_line() +
theme_minimal() + scale_y_continuous(limits = c(0, max(d$PROB))) +
ylab("Density") + xlab("Elapsed time (ms)") +
geom_area(aes(ifelse(ETIME >= mean(ETIME)-2*sd(ETIME) &
ETIME <= mean(ETIME)+2*sd(ETIME), ETIME, 0)),
fill="blue", alpha=0.5) +
geom_vline(xintercept=mean(d$ETIME), color="red", alpha=0.5, size=1) +
geom_area(aes(x=ifelse(ETIME >= 200, ETIME, 0)), fill="red", alpha=0.5) +
geom_vline(xintercept=200, color="red", alpha=0.5, size=1
# Probability of 200 millisecond run
# with average=100 milliseconds, stddev=40 milliseconds
1-pnorm(200, mean=100, sd=40)
Back to reality – why “averages” are misleading
Our model seems to indicate that “average” metric is a pretty good proxy to determine overall SQL performance.
But the devil is, again, in details. We assumed (eh … modeled) that elapsed times of our executions are normally distributed. But, are they really ?
To answer this question, let’s eschew modeling and actually measure each individual elapsed time to see where the results lead us.
And big surprise! The shape of the data (we are looking at elapsed time frequency histogram) looks nothing like the bell curve!
Now for the interesting part – look at the “average” value (represented by the red line). It is still 100 milliseconds, but it does not seem to be “describing” the shape anymore, isn’t it ? I.e. “average” is neither the most popular value, nor is it the longest, nor does it seem to hold any special place in the overall data shape.
To put it more clearly, the fact that we know that “average” value is 100 milliseconds tells us almost nothing about real query performance. In fact, it is very misleading as most real times are either a lot smaller or a lot larger than the “average”.
For example, remember our estimation of only 0.6% of queries crossing 200 millisecond threshold if using normal model. You probably already see it in the plot that this estimation is way off. In this sample, the actual measured number of such queries is 14% – certainly, not a trivial number and a big difference from the model.
R code to reproduce this example:
d <- read.table("http://intermediatesql.com/wp-content/uploads/2014/01/upsert_1hr.txt",
ggplot(d, aes(round(ETIME/1000))) + geom_histogram(binwidth=2) +
coord_cartesian(xlim=c(0, 600)) +
theme_minimal() + xlab("Elapsed time (milliseconds)") + ylab("Frequency") +
geom_vline(xintercept=mean(d$ETIME)/1000, size=1, color="red")
So that not to base my conclusions on “just 1 example” (a dangerous thing to do), I timed and measured a few different queries of various complexity. The results are below and you can judge for yourself.
The bottom line is:
Without “normality” of data distribution, “average” metric can be “just a random number”
It’s not very useful for performance tracking, at least in a general case.
Is there a better metric ? Stay tuned as I will be talking about it in the next article.