[svlug] Protecting and recovering from high system load?

Sargun Dhillon xbmodder at gmail.com
Mon Aug 28 22:39:17 PDT 2006


Try a server distro -- like debian. I can personally look at the boxen if
you would like, please send me an e-mail with your direct contact info. I
would think that you could figure out the process that is crashing the box,
then set limits on it. Also, give ssh, bash, etc better nice values.

On 8/28/06, Tin Le <tin at le.org> wrote:
>
> Yeah, sometime sh*t happens and you just don't have the time to chase
> down a root cause.... especially when engineering is breathing down your
> back about "their" network :-), the veeps want to know about why his
> email is taking too long, etc....
>
> I've used fallback-reboot successfully on 2.4 and 2.6 kernels.  I used to
> have a problem server in a remote colo that I can't get to easily (e.g.
> take literally days to schedule a visit).  Similar problem as yours...
> only a hard reset will fix it.
>
> I've been using fallback-reboot for a while.  Work great for me.
>
> http://stromberg.dnsalias.org/~strombrg/fallback-reboot/
>
> Read the install doc _VERY_ carefully, as you could leave your server
> open for anyone to bounce.  I keep the key on my usb fob...
>
> Once you have time, I'd suggest follow others' suggestion and start
> collecting logs around the times that it seem to have problem the most.
> The more logs you have the better chance of finding the root cause.
>
> Cheers,
> Tin Le
> --
> "Never continue in a job you don't enjoy. If you're happy in what you're
> doing, you'll like yourself, you'll have inner peace. And if you have
> that, along with physical health, you will have had more success than you
> could possibly have imagined." - Johnny Carson (1925-2005)
>
> > My colo server box over the last several months has periods where it
> > randomly explodes in load. It goes from around 0.75 to well over 200
> > very quickly. When this happens it becomes, obviously, almost entirely
> > unresponsive (SSH login attempts go nowhere, HTTP requests go
> > un-answered, etc), but pings continue to be responded to with not
> > slowdown or problem.
> >
> > I have sometimes been fortunate enough to be logged into the box when
> > this happens and be able to immediately begin kill -9 the PIDs that seem
> > to be on top of top. More often than not though I've had to call the ISP
> > and have the power cycled in order to recover the machine.
> >
> > So - the question - Is there a way to configure the kernel (or set up a
> > daemon, or something) that will monitor for explosive system load and
> > then take corrective action (even if that corrective action is as brute
> > force as forcing the kernel to bounce the system)?
> >
> > I'm desperate here. I've got new hardware on order, and I've got plans
> > to migrate off of some of the older software running on the machine, but
> > for now I have to deal with the problem. I need some kind of bandaid to
> > put on this.
> >
> > The kernel is:
> >
> > Linux foo.bar 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 athlon i386
> > GNU/Linux
> >
> > I appreciate any guidance.
> >
> > Thanks!
> >
> > _______________________________________________
> > svlug mailing list
> > svlug at lists.svlug.org
> > http://lists.svlug.org/lists/listinfo/svlug
> >
>
>
>
> _______________________________________________
> svlug mailing list
> svlug at lists.svlug.org
> http://lists.svlug.org/lists/listinfo/svlug
>



-- 
Sargun Dhillon
President
Atarack Communications, Inc.
(925)-202-9485
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://kenny.linuxmafia.com/pipermail/svlug/attachments/20060828/95dad0d0/attachment.htm 


More information about the Svlug mailing list