[svlug] Processes that don't die with -9

Rick Moen rick at svlug.org
Thu Dec 4 13:45:56 PST 2014


Sarah Newman wrote:

> I've had times in the past where kill -9 did not work on a process.
> [...]  But now I'm wondering if killing a different thread (LWP) other
> than the primary *could* have worked, which as I learned last night
> are shown by adding '-L' to 'ps'.

Ooh, I think I can help with this!  First, a bit about threads, then about
signals.


I. Threads:

The -L option is a ps filter to show POSIX threads aka pthreads aka 'LWP' =
light-weight process threads.  The 'LWP' column will show the thread ID
('tid') of the thread process, and the 'NLWP' column will show the number of
related threads in the thread group.  The Linux kernel scheduler actually
treats threads (LWPs) and regular processes pretty much exactly the same
way; the tid is just a means for the scheduler to treat them as having
unique identities for scheduling purposes, even though the thread group all
share the same PID.

When you observe such a thread group, the first of them will have its 
PID in both the PID column and the LWP one (instead of a normal tid 
number).  That is the parent process of the threads it then spawns with 
individual tids and all sharing its PIDs.  

I wrote the above to lay the groundwork about threads.  They aren't 
hidden from the kill command.  If you have (say) mysql running as PID 
1527 and you say 'kill -9 1527'[0], you are going to send SIGKILL[1] to the
specified mysql process plus any POSIX threads it might have spawned --
because, you see, they will be running with PID 1527, too.


II.  Signals:

I'm pretty sure your mystery is a more-general one, related not specifically
to threads but rather to how processes work and how signals work.  To begin
with, you are just sending a POSIX signal to the process.  What then happens
depends on several complicating factors that might stand in the way:

1. PID number:   If the PID number is 1, then the TERM and KILL signals
are blocked as a special case (because the init needs privilege to do its
several core jobs).

2. permissions:  _either_ the process must be started by you and not be
setuid or setgid, _or_ you must be the root user.  If those conditions
aren't met, the signal will be ignored.
 
3. blockage:  In some circumstances, the target process may have blocked 
some signals, or the signal has been blocked by the kernel on its 
behalf.  Kernel code blocks all signals in any situation where 
interrupting the system call would result in a badly formed data
structure somewhere in the kernel, or more generally in some kernel
invariant being violated.  So, if (due to a bug or misdesign) a system
call blocks indefinitely, there may effectively be no way to kill the
process.  (But the process _will_ be killed if it ever completes the
system call.)

So, for example, if the process is stuck in I/O and thus is executing 
a kernel call for that purpose (for, say, a pending disk or NFS write), 
signnals will be block for it pending leaving I/O state.  The ps or top
command will show such a process as being in state D (originally for
'disk'). 

. nonexistence:  entries marked Z (short for 'zombie') in ps/top output 
are not actually processes, but rather are garbage entries left behind
in the process table by a process that _did_ die but did not clean up
its information.  So, 'zombie' is actually a misleading concept.  These
aren't undead processes precisely; they're more like mirages.  That's
why you can't kill them.


I'll bet you _most_ of the 'processes' you were not able to kill with
'kill -9' would, upon examination, turn out to be in 'Z' state, i.e.,
nonexistent.  The rest would have been D-state processes on which
signals were blocked.


[0] Which, by the way, you can alternatively state as 'kill -s KILL 1527'.

[1] Using the kill(2) syscall to send a POSIX KILL signal to the cited
process number(s).  IIRC, any POSIX threads (phtreads) in the same process
number receive the signal via the pthread_kill(3) library function.



More information about the svlug mailing list