[svlug] Protecting and recovering from high system load?

Erich Proudfit eproudfit at ICSolutions.com
Mon Aug 28 17:03:35 PDT 2006


I would be very interested in hearing more about this issue.  

We have a few servers that are experiencing similar behavior.  We might
be having a completely different problem, but their symptoms are similar
enough.

So far, we have seen some very strange behavior.  We've seen a couple
thousand pam(sshd) session open/close in a short time span (4 per second
for 20ish minutes) before system stop responding; we've seen the date
command not update the date; trying to tail -f /var/log/messages freezes
up that ssh session; killall <command> doesn't kill any of the processes
associated with <command>;  ssh connections fail/hang, ping works fine
like there is nothing wrong; and a few others.

We are running FC3 with the 2.6.9-1.667smp kernel.  This has occurred on
a couple of different hardware platforms, but all Intel based systems.
We are limited to this specific kernel due to some added kernel modules
for some hardware that's added to these systems.

So, if anyone has additional insight to the original poster's problem,
I'm confident that it will aid us as well.

Thank you.



> Sent: Monday, August 28, 2006 1:28 PM
> To: svlug at lists.svlug.org
> Subject: [svlug] Protecting and recovering from high system load?
> 
> 
> My colo server box over the last several months has periods where it 
> randomly explodes in load. It goes from around 0.75 to well over 200 
> very quickly. When this happens it becomes, obviously, almost 
> entirely 
> unresponsive (SSH login attempts go nowhere, HTTP requests go 
> un-answered, etc), but pings continue to be responded to with not 
> slowdown or problem.
> 
> I have sometimes been fortunate enough to be logged into the box when 
> this happens and be able to immediately begin kill -9 the 
> PIDs that seem 
> to be on top of top. More often than not though I've had to 
> call the ISP 
> and have the power cycled in order to recover the machine.
> 
> So - the question - Is there a way to configure the kernel 
> (or set up a 
> daemon, or something) that will monitor for explosive system load and 
> then take corrective action (even if that corrective action 
> is as brute 
> force as forcing the kernel to bounce the system)?
> 
> I'm desperate here. I've got new hardware on order, and I've 
> got plans 
> to migrate off of some of the older software running on the 
> machine, but 
> for now I have to deal with the problem. I need some kind of 
> band aid to 
> put on this.
> 
> The kernel is:
> 
> Linux foo.bar 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 
> athlon i386 
> GNU/Linux
> 
> I appreciate any guidance.
> 
> Thanks!
> 




More information about the Svlug mailing list