How to freeze ORACLE database by strace
Sometimes when you trace things, you can discover some really interesting (and unexpected) stuff.
For example, here is a simple way to “freeze” your ORACLE database, which I “discovered” while tracing system calls in LGWR process (ORACLE 11.2.0.2 on Linux 2.6.18 ×64).
Here goes … To freeze your ORACLE database start with the following unassuming strace command on your LGWR:
Linux> strace -p $(pgrep -fx ora_lgwr_$ORACLE_SID) -o /tmp/l.out -T &
[1] 2402
Process 19998 attached - interrupt to quit
At this point, nothing bad happens yet, trace data is flowing into /tmp/l.out and the database is, in general, happy.
Weirdness starts when you detach (or kill) your strace process.
Process 19998 detached
At the first glance, nothing seems to be wrong, the database is still running, sessions are connected and executing queries etc.
But when you try to COMMIT (or execute DDL which commits implicitly)
Your command will wait … and wait … and wait …
v$session will show that your session (along with all others trying to do commits) is waiting for log file sync wait event. But why ?
A quick look at LGWR process state clarifies the issue:
PID TTY STAT TIME COMMAND
19998 ? Ts 0:00 ora_lgwr_test11
According to ps manual, T state means that the process is “Stopped, either by a job control signal or because it is being traced” (the latter is a lie because tracing has just stopped). And, unfortunately in this case, unless something happens, LGWR will be in this state forever stalling all database activity.
So, the question becomes, how can we “unstop” the LGWR ?
Surprisingly, the most straightforward method is to attach strace to LGWR again. This unfreezes the database and let transactions flow. Unfortunately, that also means that trace information is also collected, which generally slows things down.
To avoid it, a better way is to send LGWR (POSIX) SIGCONT signal to nudge it into “ready to run” state.
Linux> ps $(pgrep -fx ora_lgwr_${ORACLE_SID})
PID TTY STAT TIME COMMAND
19998 ? Ss 0:00 ora_lgwr_test11
which solves the issue.
I’ve seen it on a couple of my systems and it “works” with both 11.1.0.7 and 11.2.0.2 databases. This seems like a bug and I do not know whether it only applies on my particular strace/Linux/ORACLE version(s), but, well, sometimes you just have to deal with the problems like that …
Cheers!
February 21st, 2011 - 02:20
I have similar experience with 11g database running on HP-UX Itanuim, when running truss command to trace dbconsole, which freezes the command “emctl start dbconsole”