Intermediate SQL Color Coded SQL, UNIX and Database Essays

18Feb/112

How to freeze ORACLE database by strace

Sometimes when you trace things, you can discover some really interesting (and unexpected) stuff.

For example, here is a simple way to “freeze” your ORACLE database, which I “discovered” while tracing system calls in LGWR process (ORACLE 11.2.0.2 on Linux 2.6.18 ×64).

Here goes … To freeze your ORACLE database start with the following unassuming strace command on your LGWR:

# Start STRACE for your LGWR
Linux> strace -p $(pgrep -fx ora_lgwr_$ORACLE_SID) -o /tmp/l.out -T &
[1] 2402
Process 19998 attached - interrupt to quit

At this point, nothing bad happens yet, trace data is flowing into /tmp/l.out and the database is, in general, happy.

Weirdness starts when you detach (or kill) your strace process.

Linux> kill %1
Process 19998 detached

At the first glance, nothing seems to be wrong, the database is still running, sessions are connected and executing queries etc.

But when you try to COMMIT (or execute DDL which commits implicitly)

SQL> CREATE TABLE t (n NUMBER);

Your command will wait … and wait … and wait …

v$session will show that your session (along with all others trying to do commits) is waiting for log file sync wait event. But why ?

A quick look at LGWR process state clarifies the issue:

Linux> ps $(pgrep -fx ora_lgwr_$ORACLE_SID)
  PID TTY      STAT   TIME COMMAND
19998 ?        Ts     0:00 ora_lgwr_test11

According to ps manual, T state means that the process is “Stopped, either by a job control signal or because it is being traced” (the latter is a lie because tracing has just stopped). And, unfortunately in this case, unless something happens, LGWR will be in this state forever stalling all database activity.

So, the question becomes, how can we “unstop” the LGWR ?

Surprisingly, the most straightforward method is to attach strace to LGWR again. This unfreezes the database and let transactions flow. Unfortunately, that also means that trace information is also collected, which generally slows things down.

To avoid it, a better way is to send LGWR (POSIX) SIGCONT signal to nudge it into “ready to run” state.

Linux> kill -SIGCONT 19998
Linux> ps $(pgrep -fx ora_lgwr_${ORACLE_SID})
  PID TTY      STAT   TIME COMMAND
19998 ?        Ss     0:00 ora_lgwr_test11

which solves the issue.

I’ve seen it on a couple of my systems and it “works” with both 11.1.0.7 and 11.2.0.2 databases. This seems like a bug and I do not know whether it only applies on my particular strace/Linux/ORACLE version(s), but, well, sometimes you just have to deal with the problems like that …

Cheers!

Comments (2) Trackbacks (0)
  1. I have similar experience with 11g database running on HP-UX Itanuim, when running truss command to trace dbconsole, which freezes the command “emctl start dbconsole”

  2. Check version of strace being used.

    $ rpm -qp –changelog strace-4.5.18-5.el5_5.5.x86_64.rpm | grep -i 594616
    – Handle special ptrace signal during detach (#594616)

    This update fixes the following bug:

    • when detaching from a process, a misinterpreted status caused strace to always
      leave the process being traced in a stopped state. With this update, the process
      is left in the correct state after detaching. (BZ#594616)

Leave a comment

No trackbacks yet.