[svlug] Protecting and recovering from high system load?

Tin Le tin at le.org
Mon Aug 28 23:26:22 PDT 2006


You must have a nice job, lots of spare time on your hands.... ;-)

I certainly appreciate the offer...  really, but I've fixed it, years
ago.  I mentioned the problem as an example of an instance where one have
too many other more important things on the plate, and there's a pesky
server that won't stay up, and one simply don't have the time to fix it
properly.  In my case, it would have taken a minimum of 3 days to
schedule physical access to where the server was located.  3 days of down
time was not an option.  So fallback-reboot saved my bacon until I could
get there (after 3 days) and properly fix it.

And let's not get into distro wars... it's so yesterday.  There's a
distro for everyone.  Personally, I don't think any of them is perfect. 
You use what you like, and I'll use what I like. :-)

Tin Le
-- 
"Never continue in a job you don't enjoy. If you're happy in what you're
doing, you'll like yourself, you'll have inner peace. And if you have
that, along with physical health, you will have had more success than you
could possibly have imagined." - Johnny Carson (1925-2005)


> Try a server distro -- like debian. I can personally look at the boxen
if you would like, please send me an e-mail with your direct contact
info. I would think that you could figure out the process that is
crashing the box,
> then set limits on it. Also, give ssh, bash, etc better nice values.
>
> On 8/28/06, Tin Le <tin at le.org> wrote:
>>
>> Yeah, sometime sh*t happens and you just don't have the time to chase
down a root cause.... especially when engineering is breathing down
your
>> back about "their" network :-), the veeps want to know about why his
email is taking too long, etc....
>>
>> I've used fallback-reboot successfully on 2.4 and 2.6 kernels.  I used to
>> have a problem server in a remote colo that I can't get to easily
(e.g. take literally days to schedule a visit).  Similar problem as
yours... only a hard reset will fix it.
>>
>> I've been using fallback-reboot for a while.  Work great for me.
>>
>> http://stromberg.dnsalias.org/~strombrg/fallback-reboot/
>>
>> Read the install doc _VERY_ carefully, as you could leave your server
open for anyone to bounce.  I keep the key on my usb fob...
>>
>> Once you have time, I'd suggest follow others' suggestion and start
collecting logs around the times that it seem to have problem the
most. The more logs you have the better chance of finding the root
cause.
>>
>> Cheers,
>> Tin Le
>> --
>> "Never continue in a job you don't enjoy. If you're happy in what you're
>> doing, you'll like yourself, you'll have inner peace. And if you have
that, along with physical health, you will have had more success than
you
>> could possibly have imagined." - Johnny Carson (1925-2005)
>>
>> > My colo server box over the last several months has periods where it
randomly explodes in load. It goes from around 0.75 to well over 200
very quickly. When this happens it becomes, obviously, almost
>> entirely
>> > unresponsive (SSH login attempts go nowhere, HTTP requests go
un-answered, etc), but pings continue to be responded to with not
slowdown or problem.
>> >
>> > I have sometimes been fortunate enough to be logged into the box
when this happens and be able to immediately begin kill -9 the PIDs
that
>> seem
>> > to be on top of top. More often than not though I've had to call the
>> ISP
>> > and have the power cycled in order to recover the machine.
>> >
>> > So - the question - Is there a way to configure the kernel (or set up
>> a
>> > daemon, or something) that will monitor for explosive system load
and then take corrective action (even if that corrective action is
as
>> brute
>> > force as forcing the kernel to bounce the system)?
>> >
>> > I'm desperate here. I've got new hardware on order, and I've got
>> plans
>> > to migrate off of some of the older software running on the machine,
>> but
>> > for now I have to deal with the problem. I need some kind of bandaid
>> to
>> > put on this.
>> >
>> > The kernel is:
>> >
>> > Linux foo.bar 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 athlon
>> i386
>> > GNU/Linux
>> >
>> > I appreciate any guidance.
>> >
>> > Thanks!
>> >
>> > _______________________________________________
>> > svlug mailing list
>> > svlug at lists.svlug.org
>> > http://lists.svlug.org/lists/listinfo/svlug
>> >
>>
>>
>>
>> _______________________________________________
>> svlug mailing list
>> svlug at lists.svlug.org
>> http://lists.svlug.org/lists/listinfo/svlug
>>
>
>
>
> --
> Sargun Dhillon
> President
> Atarack Communications, Inc.
> (925)-202-9485
>








More information about the Svlug mailing list