[svlug] Protecting and recovering from high system load?
DzM
svlug at dzm.com
Mon Aug 28 13:27:51 PDT 2006
My colo server box over the last several months has periods where it
randomly explodes in load. It goes from around 0.75 to well over 200
very quickly. When this happens it becomes, obviously, almost entirely
unresponsive (SSH login attempts go nowhere, HTTP requests go
un-answered, etc), but pings continue to be responded to with not
slowdown or problem.
I have sometimes been fortunate enough to be logged into the box when
this happens and be able to immediately begin kill -9 the PIDs that seem
to be on top of top. More often than not though I've had to call the ISP
and have the power cycled in order to recover the machine.
So - the question - Is there a way to configure the kernel (or set up a
daemon, or something) that will monitor for explosive system load and
then take corrective action (even if that corrective action is as brute
force as forcing the kernel to bounce the system)?
I'm desperate here. I've got new hardware on order, and I've got plans
to migrate off of some of the older software running on the machine, but
for now I have to deal with the problem. I need some kind of bandaid to
put on this.
The kernel is:
Linux foo.bar 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 athlon i386
GNU/Linux
I appreciate any guidance.
Thanks!
More information about the Svlug
mailing list