[svlug] Protecting and recovering from high system load?

DzM svlug at dzm.com
Mon Aug 28 13:27:51 PDT 2006


My colo server box over the last several months has periods where it 
randomly explodes in load. It goes from around 0.75 to well over 200 
very quickly. When this happens it becomes, obviously, almost entirely 
unresponsive (SSH login attempts go nowhere, HTTP requests go 
un-answered, etc), but pings continue to be responded to with not 
slowdown or problem.

I have sometimes been fortunate enough to be logged into the box when 
this happens and be able to immediately begin kill -9 the PIDs that seem 
to be on top of top. More often than not though I've had to call the ISP 
and have the power cycled in order to recover the machine.

So - the question - Is there a way to configure the kernel (or set up a 
daemon, or something) that will monitor for explosive system load and 
then take corrective action (even if that corrective action is as brute 
force as forcing the kernel to bounce the system)?

I'm desperate here. I've got new hardware on order, and I've got plans 
to migrate off of some of the older software running on the machine, but 
for now I have to deal with the problem. I need some kind of bandaid to 
put on this.

The kernel is:

Linux foo.bar 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 athlon i386 
GNU/Linux

I appreciate any guidance.

Thanks!




More information about the Svlug mailing list