[svlug] Protecting and recovering from high system load?

Tin Le tin at le.org
Mon Aug 28 22:03:00 PDT 2006


Yeah, sometime sh*t happens and you just don't have the time to chase
down a root cause.... especially when engineering is breathing down your
back about "their" network :-), the veeps want to know about why his
email is taking too long, etc....

I've used fallback-reboot successfully on 2.4 and 2.6 kernels.  I used to
have a problem server in a remote colo that I can't get to easily (e.g.
take literally days to schedule a visit).  Similar problem as yours...
only a hard reset will fix it.

I've been using fallback-reboot for a while.  Work great for me.

http://stromberg.dnsalias.org/~strombrg/fallback-reboot/

Read the install doc _VERY_ carefully, as you could leave your server
open for anyone to bounce.  I keep the key on my usb fob...

Once you have time, I'd suggest follow others' suggestion and start
collecting logs around the times that it seem to have problem the most. 
The more logs you have the better chance of finding the root cause.

Cheers,
Tin Le
-- 
"Never continue in a job you don't enjoy. If you're happy in what you're
doing, you'll like yourself, you'll have inner peace. And if you have
that, along with physical health, you will have had more success than you
could possibly have imagined." - Johnny Carson (1925-2005)

> My colo server box over the last several months has periods where it
> randomly explodes in load. It goes from around 0.75 to well over 200
> very quickly. When this happens it becomes, obviously, almost entirely
> unresponsive (SSH login attempts go nowhere, HTTP requests go
> un-answered, etc), but pings continue to be responded to with not
> slowdown or problem.
>
> I have sometimes been fortunate enough to be logged into the box when
> this happens and be able to immediately begin kill -9 the PIDs that seem
> to be on top of top. More often than not though I've had to call the ISP
> and have the power cycled in order to recover the machine.
>
> So - the question - Is there a way to configure the kernel (or set up a
> daemon, or something) that will monitor for explosive system load and
> then take corrective action (even if that corrective action is as brute
> force as forcing the kernel to bounce the system)?
>
> I'm desperate here. I've got new hardware on order, and I've got plans
> to migrate off of some of the older software running on the machine, but
> for now I have to deal with the problem. I need some kind of bandaid to
> put on this.
>
> The kernel is:
>
> Linux foo.bar 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 athlon i386
> GNU/Linux
>
> I appreciate any guidance.
>
> Thanks!
>
> _______________________________________________
> svlug mailing list
> svlug at lists.svlug.org
> http://lists.svlug.org/lists/listinfo/svlug
>






More information about the Svlug mailing list