Intermediate SQL Color Coded SQL, UNIX and Database Essays

2Feb/142

How to track SQL performance. Part 3: Measuring percentiles

In the previous article, I introduced the concept of percentiles and you can hopefully see that measuring percentiles is important to track SQL performance as it makes performance metrics a lot more precise and meaningful.

Unfortunately, ORACLE does not collect percentile data out of the box, so we need to figure out a way to do it ourselves.

A couple of general notes on collection process

In order to come up with a precise summary metric (such as a percentile), we need to have precise and complete RAW data, in other words:

To be 100% precise, we need to collect ALL data points

Keep in mind, however, that while collecting all data points should be our goal, practical reasons sometimes might prevent it from happening.

I.e. RAW data can be huge and keeping and processing it might require unreasonable resources. Alternatively, the collection method might be very intrusive and could adversely affect performance.

So, practically speaking we might find ourselves trading some precision for a “lighter resource use” by i.e. resorting to random sampling of individual runs instead of collecting them all.

We might also, at times, trade precision for convenience i.e. if alternative methods give us a more usable data.

An important warning: if we start collecting partial RAW data, it is important to watch ourselves so that we do not introduce any systematic bias into data “selection” process.

Collecting partial raw data ? Watch for systematic “selection” bias

Now that you have been warned, let’s look at a few methods how the data for percentile calculations can be collected:

Client side tracing

Client side tracing is probably the most useful method, but it also is the most difficult to achieve as you need to (have developers) instrument your application code.

The idea here is pretty simple. Whenever client app executes a SQL, i.e.

exec_sql("MERGE INTO orders_table USING dual ON ...", bind_list)

it should surround that call with time measurement instrumentation, something like:

start_time = date()
exec_sql("MERGE INTO orders_table USING dual ON ...", client_id=..., bind_list)
end_time = date()

log_raw_exec(timestamp=date(), sql_id=..., client_id=...,
  bind_list=..., elapsed=end_time-start_time)

You might initially think, wow, this is going to need a lot of storage! But the storage requirements are actually pretty modest.

I.e., assuming, 120 executions per second and 1000 bytes to store a log record (usually enough to record all relevant information, including bind data), we are talking about:

120 * 1000 * 3600*24 = ~ 10 Gb per day

10 Gb is not a lot of storage and storage is cheap these days.

A bigger issue is how to handle raw data collections and percentile calculations from multiple application hosts, but solving it is beyond the scope of this article (suffice it to say, there are multiple ways to do it).

The great benefit of client side tracing is that you get a TRUE performance metric: i.e. query latency that the client actually experiences. It also has a potential to capture all RAW data points (or do the non-biased random sampling) with relative ease.

Another major benefit is that you can capture and store performance metrics (“percentiles”) continuously – if required infrastructure is in place, it is fairly simple to extend it to capture all the (important) sqls and do it all the time.

On the downside, implementing client side metrics requires a lot of initial effort (albeit, an effort well spent). It is something that data administrators can not directly control and have to rely on other people to implement.

Finally, client side metrics measure overall latency, not necessarily related to what happens in the database itself (i.e. measured SQL latency may be caused by network or by application box running with 100% CPU utilization).

Server side (10046) tracing

Even if you collect client side SQL latency metrics, server side metrics might still be useful as they can help answer the question:

Is SQL latency caused by the database ?

The idea of ORACLE server side metrics is very similar to client side metrics that we discussed before. ORACLE uses similar mechanism to time important system and database calls and can expose these timings in its trace files.

Unfortunately (or fortunately, depending on what you are doing), server side traces will not show times for individual SQL executions, but rather for sub components of them.

I.e. an individual UPSERT run can go through a parsing stage, then read 100 index and 50 table blocks (incurring 150 “db file sequential read” events in the process), then have 7 separate concurrency waits while updating the data. The client side metric in this case will record just one overall time for this statement, but in a server trace you will see 158 individual timed events and you will need to apply some effort to “merge” them to see “overall execution time”.

Server traces are also spread out between multiple trace files and you need to work (and be very careful) to congregate and filter them correctly.

Collecting 10046 trace data for a specific sql_id is pretty easy. Starting with (ORACLE 11.1, I believe), you can run the following commands:

define SQL_ID=...
ALTER system SET events 'sql_trace [sql:&SQL_ID] wait=true, bind=true';
EXEC dbms_lock.sleep(&SLEEP_SECONDS);
ALTER system SET events 'sql_trace [sql:&SQL_ID] off';

run_10046_for_sql.sql script attached in the tools section can do it for you.

SQL> @run_10046_for_sql.SQL AA 60 fdcz4kx11era5

== TRACE DIR
/u01/diag/rdbms/db1/db1/trace

== NOW file
db1_ora_9049_MRTRACE_AA.trc

== RELEVANT TRACE FILES
db1_ora_16197.trc db1_ora_16743.trc db1_ora_16762.trc db1_ora_17147.trc ...

Organizing 10046 trace data and calculating percentiles on the other hand is very complex and is (sic) beyond the scope of this article. If you are interested in pursuing it, I highly recommend Cary Millsap’s MrTools package that make this process a lot easier.

V$SQL sampling

Client and server side tracing are both great tools, but unfortunately, they either require extensive infrastructure (that somebody has to build and maintain) or highly specialized tools.

Wouldn’t it be nice to be able to calculate percentiles using generic, out of the box tools that are available everywhere and are easy to use ? After all, ORACLE is an extremely well instrumented database.

It turns out, there is a number of ways to do it, if we approach the problem creatively.

Let’s start with V$SQL view.

As you know, it has both EXECUTIONS and ELAPSED_TIME columns, which makes it easy to calculate “elapsed time” per execution:

SELECT elapsed_time, executions, elapsed_time/executions
FROM v$sql
WHERE sql_id='&SQL_ID' AND child_number=&CHILD_NO
/

           ELAPSED_TIME EXECUTIONS ELAPSED_TIME/EXECUTIONS
----------------------- ---------- -----------------------
           229982503701    1673471              137428.437

Of course, the problem is, as we discussed, is that “elapsed_time/executions” averages data since the start of the instance and does not represent the “actual” execution times well.

However, what if we could query V$SQL with very high frequency ?

I.e. what if the first query against V$SQL produces these numbers:

           ELAPSED_TIME EXECUTIONS
----------------------- ----------
           229982503701    1673471

and the next query a few microseconds later produces these numbers:

           ELAPSED_TIME EXECUTIONS
----------------------- ----------
           229983016048    1673472

Notice that 1673472-1673471 = 1 execution, which means that we just collected elapsed_time (as well as other metrics) for a single run of this statement, and its elapsed time was: 229983016048-229982503701=512347 (microseconds).

You might think that you need to write a complex PL/SQL procedure to do the sampling, but there is a much better way, which Tanel Poder pioneered in his waitprof.sql script.

The trick here is that V$SQL is not a “real” table – it is a memory array, which does not have read consistency. As a result, sampling mechanism can be implemented by a single SQL that queries V$SQL over and over using nested loops and each nested loop iteration (potentially) sees a new data point.

Since this method does not need to continuously switch between SQL and PL/SQL engines, it can query v$SQL with a much higher frequency, generally resulting in more data points collected and better precision.

Ok, so how can we do that ?

I attached a few scripts that implement this method. Let me show you how to use them:

First of all, let’s check out the current “average” elapsed time since the start of the instance:

SQL> @sqlc fdcz4kx11era5

                                     Gets    Ela (ms) LAST
  C#   Plan hash   EXECUTIONS       pExec       pExec Active
---- ----------- ------------ ----------- ----------- ------------
   2   245875337    1,700,541      444.62      137.57 +0 00:00:01
   7   245875337            2       23.50       21.39 +0 01:15:16
   3   245875337            1       26.00       10.38 +27 04:42:52

Now, let’s sample the same sql (child cursor 2) 1000 times and see the raw times for (mostly) individual runs of this SQL:

SQL> @ssql fdcz4kx11era5 2 1000

           Elapsed      CPU           IO      App       CC
S  Ex         TIME     TIME         TIME     TIME     TIME   Pct

- --- ------------ -------- ------------ -------- -------- -----
    1          330        0            0        0        0     0
    1          340    1,000            0        0        0  3.33
    1          786      999            0        0        0  6.67
    1        1,518    2,000          188        0        0    10
*   2       11,963    1,999       11,103        0        0 13.33
    1       14,851    4,999       10,908        0        0 16.67
    1       15,724    2,000       14,780        0        0    20
    1       16,471    2,000       15,163        0        0 23.33
*   2       19,187    1,000       18,521        0        0 26.67
*   2       22,613    1,500       20,679        0        0    30
    1       23,958    1,999       22,920        0        0 33.33
    1       25,594    3,999       21,342        0        0 36.67
    1       26,490    1,999       25,817        0        0    40
    1       28,663    6,999       23,057        0        0 43.33
*   2       36,138    1,000       35,450        0        0 46.67
    1       43,029    3,000       41,486        0        0    50
    1       46,247    1,000       45,130        0        0 53.33
    1       49,952    2,000       48,060        0        0 56.67
    1       55,507    1,999       54,435        0        0    60
    1       59,578        0       58,358        0        0 63.33
    1       63,472    6,999       58,031        0        0 66.67
*   2       66,607    1,000       65,632        0        0    70
    1       68,861      999       67,735        0        0 73.33
    1       71,369    4,999       67,908        0        0 76.67
    1       77,812    5,999       72,675        0        0    80
    1       88,753    6,999       82,838        0        0 83.33
    1       90,256    5,999       87,365        0        0 86.67
    1       97,171    2,000       93,585        0       27    90
    1      120,635    1,999      117,660        0        0 93.33
    1      142,201    6,999      138,853        0        0 96.67
    1      167,552    4,998      165,333        0        0   100

Notice that for the the most part, we were able to isolate single executions.

The nice part about this data is that we can see what the actual distribution of execution times look like. And check out “pct” columns – these are the “elapsed time” percentiles. I.e., based on this sample, the worst 10% of executions are at least as bad as 97,171 microseconds or 97 milliseconds, which is not bad at all.

We can also do percentile analysis directly, by bucketing and summarizing the data.

I.e. let’s calculate p10, p20, …, p90 percentiles by splitting the range of values into 10 buckets and assigning the same number of executions in each bucket.

SQL> @ssql2 fdcz4kx11era5 2 50000 avg 10

=======================================================================================
Sampling sql_id=fdcz4kx11era5:2
MERGE INTO orders_table USING dual ON ...

GOAL: Calculate "elapsed time" percentiles (10 buckets, ~ same # OF executions IN each)

50000 times. Running elapsed_time average=    136.36 ms

Displaying: "avg" metric IN a bucket (ALL times are IN milliseconds)
=======================================================================================

             Elapsed                                CPU          IO
Pct    Execs TIME                                  TIME        TIME
--- -------- ------------------------------ ----------- -----------
p0       148 .23-7.11                               .89        2.30
p10      148 7.18-14.03                            1.11        9.44
p20      146 14.03-20.26                           1.48       15.82
p30      143 20.39-29.01                           1.86       22.92
p40      146 29.1-40.73                            1.91       32.63
p50      143 40.77-55.21                           2.37       45.50
p60      142 55.22-77.92                           3.15       63.09
p70      145 77.99-113.33                          3.58       90.72
p80      141 113.41-173.64                         4.46      136.22
p90      138 174.34-634.15                         6.83      245.30

Another way to look at the same data is to split the range into 10 buckets, but now making buckets cover equal time ranges and tracking how many executions fall into each bucket. This was, we can produce a frequency histogram and see if our data is, i.e. normally distributed or not:

SQL> @ssql3 fdcz4kx11era5 2 50000 avg 10

===================================================================================
Sampling sql_id= fdcz4kx11era5:2
MERGE INTO orders_table USING dual ON ...

GOAL: Calculate "elapsed time" histogram (10 buckets, ~ same TIME range IN each)

50000 times. Running elapsed_time average=    136.28 ms

Displaying: "avg" metric IN a bucket (ALL times are IN milliseconds)
===================================================================================


                                                    Elapsed         CPU          IO
Bucket Range (ms)              Execs Graph             TIME        TIME        TIME
------ -------------------- -------- ---------- ----------- ----------- -----------
     1 .19-51.81                 686 ##########       22.39        1.51       20.91
     2 51.81-103.44              303 ####             76.37        2.89       73.75
     3 103.44-155.07             198 ##              127.59        3.55      124.23
     4 155.07-206.69              91 #               174.25        4.68      169.82
     5 206.69-258.32              46                 224.91        5.47      220.11
     6 258.32-309.95              22                 267.26        6.90      261.46
     7 309.95-361.57               7                 339.04        9.00      331.30
     8 361.57-413.2                8                 264.19        6.90      258.24
     9 413.2-464.83                3                 318.62        6.00      311.41
    10 464.83-516.45               2                 492.26       10.00      483.53

All in all, there is a lot of flexibility in this approach – it should work for every SQL in every database, without the need for any infrastructure or additional database objects. We can also join V$SQL with other dictionary views, i.e. V$SESSION to collect even more data about the runs, which machines they are coming from, what wait events were active at the time of collection etc

There is a couple of downsides to be aware of.

First of all, sampling may not be precise enough to capture every single sql execution, some runs are so fast that single nested loop captures “many” executions (in which case, you are still dealing with average metrics, but on a much smaller scale).

Also, while V$SQL captures “server side” times, i.e. IO time, CPU time or APP wait time (waits on “enq: TX row lock contention”) etc, client/server network communication times are not captured and they could be significant in some cases, i.e. when a SQL performs multiple fetches.

There is also a nagging suspicion that V$SQL, being not read consistent, might not present accurate data in the first place. I.e. if “elapsed time” is updated independently from “executions”, our sample might capture them at different time points. In my tests however, I got a pretty accurate mapping of client elapsed times to elapsed times sampled from V$SQL, so this does not (so far) appears to be a major deal breaker for me.

SQL monitoring

Another way to time individual sql runs is to use ORACLE sql monitoring infrastructure. ORACLE, by default, already tracks all queries that run for more than 5 seconds and exposes their metrics in v$SQL_MONITOR view.

However, what if we are interested in sub second queries – is there anything we can do to capture them ?

ORACLE allows some flexibility in sql monitoring thresholds, i.e. _sqlmon_threshold parameter controls how long a query needs to run (in seconds), before ORACLE makes it “monitored”. Unfortunately its granularity is in seconds, so if your query runs in under a second, you seem to be out of luck.

SQL> @hidden sqlmon

NAME                     VALUE   DESCRIPTION
------------------------ ------- ----------------------------------------------------
_sqlmon_binds_xml_format default format of column binds_xml in [G]V$SQL_MONITOR
_sqlmon_max_plan             480 Maximum number of plans entry that can be monitored.
                                 Defaults to 20 per CPU
_sqlmon_max_planlines        300 Number of plan lines beyond which a plan cannot be
                                 monitored
_sqlmon_recycle_time          60 Minimum time (in s) to wait before a plan entry can
                                 be recycled
_sqlmon_threshold              5 CPU/IO time threshold before a statement is monitored.
                                 0 is disabled

But all is not lost. ORACLE also allows you to supplant queries with /*+ MONITOR */ hint, in which case, it will be monitored regardless of how much (or, more precisely: how little) time it takes.

There are multiple ways to dynamically add hints to queries without modifying their text. My favorite tool to do that is SQL profiles but SPM baselines or “patches” can do the job equally well.

Since the logic here is fairly involved (add /*+ MONITOR */ hint, wait until new cursor appears, stop capturing when V$SQL_MONITOR is “full”), I encapsulated it inside sqltrce_pkg package, which you can find attached.

Once the hint is added to your query and capture is enabled, you’ll start seeing individual runs (SQL_EXEC_IDs) appearing in V$SQL_MONITOR view. When you get enough “runs”, you can perform the same type of percentile or frequency analysis that we discussed in the previous section on V$SQL sampling.

Unfortunately, default capture size with SQL monitoring is pretty small. I.e. in my case it is maxed out at 480 individual executions (_sqlmon_max_plan) and while you can increase it, it requires a database bounce.

Also, the space to hold monitoring data gets allocated out of the shared pool (thanks, Doug), so it probably is more suitable to capture hundreds of executions anyway, rather than hundreds of thousands.

For that reason, I prefer V$SQL sampling to V$SQL_MONITOR as it can capture hours worth of data, even for high frequency OLTP queries without any issues. Capturing individual runs by SQL monitoring on the other hand will likely max out in minutes (or, even, seconds).

There are a couple of benefits that SQL monitoring gives you over V$SQL sampling.

First of all, it’s low resource utilization. ORACLE does not need to do a lot of extra stuff when monitoring is enabled and, as a result, monitoring adds no perceptible additional load to the system.

monitor_cpu_utilization

Sampling, on the other hand, is fairly intrusive. It will essentially lock one of the CPU cores entirely for itself during the time when it runs.

sample_cpu_utilization

However, this “CPU grabbing” effect is not as bad as it sounds. I.e. on a system with 24 CPU cores (which is not top of the line nowadays by any means), at max 1/24 or 4.17% of CPU capacity will be taken away from the system by the sampling process. Unless your system is redlining on CPU it is pretty negligent.

If you are running ORACLE 11.2+, SQL monitoring also captures bind data, which could come very useful in further performance analysis.

Conclusions

I hope this presentation showed a few viable ways to track and measure percentile metrics.

What is the best way ?

My personal preference is to use client side tracing as a primary method and supplement it with V$SQL sampling to weed out non-database related waits. But your mileage can vary, of course.

The bottom line is that all of these methods will give you a lot more precise and relevant performance data than relying on out of the box “averages” and more knowledge can never hurt.

So, happy tracing (or sampling and monitoring), my friends :-)

Tools

If you are interested in researching it further on your own, I attached a number of tools that I’ve been using to capture SQL performance data.

Let me give you a brief overview of them.

The workhorse of the toolset are ssql*.sql (“sample sql”) scripts – which sample V$SQL view with high frequency.

  • ssql.sql will capture and display “individual runs”
  • ssql2.sql will capture “individual runs” and bucket them into percentile ranges (the same number of executions in each bucket), so that you can see timings separately for p10, p90, p99 etc
  • ssql3.sql will capture “individual runs” and bucket them into equal “time ranges” to produce frequency histograms

These scripts do not need any objects created and do not store anything in a database.

If you want to store sampling capture results in the database (i.e. for further analysis), sqltrce_pkg PL/SQL can help you with that.

sqltrce_pkg.sample_sql() routine will sample SQL for a specified time duration by running a continuous sequence of samples against V$SQL and dumping results into sqltrce_results table. Results can be either kept in memory during the entire runtime or dumped to the table periodically (which let’s you run sampling for “hours” without jeopardizing the system). Routine also has an ability to find and save relevant ASH entries (for the SQL that you are sampling) as well as find optimal sample size by exponentially scaling “seed size” when necessary.

Once results are captured in sqltrce_results, you can query them with tsql.sql (“raw data”), tsql2.sql (“percentiles”) and tsql3.sql (“frequency histograms”).

If you want to collect data through SQL monitoring, sqltrce_pkg.monitor_sql() routine can help with that. Routine will add /*+ MONITOR */ hint by a SQL profile, wait until the first execution of the new cursor and start collecting executions until enough of them is there (“MAX executions to collect” is a parameter). It will then save executions to sqltrce_sql_monitor table for further analysis and can also save relevant ASH entries.

Once the data is in sqltrce_sql_monitor, you can query it with msql.sql (“raw data”), msql2.sql (“percentiles”) and msql3.sql (“frequency histograms”).

Comments (2) Trackbacks (0)
  1. HI,

    I came to know about your site today , and i cannot stop myself reading your blogs. I am going to use your analysis to find SQL’s which today are running fine (under 5 secs ) and tomorrow can cause a problem when data in the table increases.
    Devendra

  2. Hello Devendra,

    I’m glad that you found my notes useful :-) Good luck running your queries and be sure to share if you find any interesting use cases or problems with the approach or the scripts.

    Cheers,
    Maxym Kharchenko


Leave a comment

No trackbacks yet.