[svlug] System Panic Makes My Life Easier

Rick Moen rick at linuxmafia.com
Thu Jul 28 01:42:33 PDT 2016

Quoting Joseph Brenner (doomvox at gmail.com):

> I've been puzzling over a sick linux box for a little while lately.
> It's a dual-Opteron box (over ten years old now? wow...) that I've been
> upgrading off-and-on (bigger disks, a new video card...),
> but after a recent round of software upgrades it had been
> acting incredibly flaky, with uptimes of only a few days.  It would
> totally lock-up and require a hard reboot... couldn't even ssh into it.
> I was trying to get an idea of what software change could've caused
> this problem-- the list of possibilities was long-- but of late the
> problem has gotten far worse, and it throws system panics and won't
> boot at all, so it's almost certainly a hardware problem. The cpu fan
> has been in bad shape for some time... it's got some cooling now, but
> I can easily imagine it's lifespan was shortened by overheating in the
> past.
> But, I've been wondering about how one would narrow down one's
> suspicions about flaky software, and I thought I would ask how one
> would go about it, even though it's just academic now.  Look through
> system logs?  Play with dtrace?

I kind of like the problem scenario you've just posed, because one could
spend a lifetime on it.  ;->

Well, actually, let's get serious.  First thing would be to attempt to
bifurcate the problem space:  Test to determine - might it be software,
or can that possibility be eliminated?

Get yourself your favourite live-CD distro.  Put it onto your choice of
CD/DVD/Blu-Ray or a USB thumb drive.  Boot it.  I personally would use
one of the (several) DE flavours of Siduction for this purpose.  I'd
probably go for the Fluxbox flavour, but you can indulge what you like:

Booting that, you are now running a full-fledged Linux distribution
entirely from the boot media and a big RAMdisk.  You are -not- using any
of the code on the installed system.  Now... let's see...  You said you
were getting lockups and/or 'system panics' (kernel panics?) every few
days.  So, _maybe_ just having it sitting there running is enough.  Or
maybe not.  Maybe you need to start a few services, load up some RAM.
If you want to torture-test the hardware while running the live-CD load,
you have to get a little creative and fire up the software brass band.  

With a _little_ effort, the VA Linux Systems Cerberus Test
Control System (va-ctcs, or CTCS, or Cerberus) can be made to run from a
system running on a RAMdisk.  CTCS normally expects the ability to write
to local disk, so that's pretty much the only gotcha you have to work
around.  CTCS _seriously_ torture-tests any hardware set, doing a great
many simultaneous tasks including iterative kernel compiles.  Stuff at:

'Cerberus FAQ' at http://linuxmafia.com/kb/Hardware/ .
Link is to the same SourceForge repo where CTCS is semi-maintained.
(Looks like the last code check-in was 11 years ago, but one person's
'unmaintained' is sometimes another person's 'feature-complete and

After a few days, then, to quote an old boss's expression, 'What do we
know?'  We know either that the system fell over, in which case it
definitely wasn't your software and is pretty definitively your
hardware, or that the system didn't fall over, in which case it's almost
certainly your software.

I am curious about the exact symptom.  You say in different parts of the
narrative 'hard lock-up', 'totally lock-up', and 'system panics'.  Not a
complaint, but do you have any more-specific indicators or description
you could give?  That could be any of several system-not-responding
scenarios.  Here are some tips to narrow that down:

Imagine a system that normally runs X11 (which based on mention of
Firefox seems likely the case, here).  Let's say you are doing
that and you either do or do not normally run a screensaver / X11
locking application.

If the latter, you would probably want to disable your screensaver /
screenlock-thingie so as to be able to gather data better.  You're
running that for several days and then... what specifically?  Is there
no longer even the ability to move the mouse pointer?  Or does the mouse
pointer move around but you cannot get it to activate menus and do
things to screen objects?  Does Ctrl-Alt-F1 no longer transport you to
the text console on console #1?  Knowing the answer to those questions
helps determine how much of your system is non-responding.

Or imagine a server-type system or some other system that normally does
_not_ run X11.  You connect a monitor to it (so you have a functional
local console)  You login and leave yourself logged in.  Also, you do
this at a bash prompt:

   setterm -blank 0

This disables the console screen blanker so that you can see what most
recently happened on the console (if anything) if/when the system
freezes, or whatever the heck it does.  Again, you wait a few days.  
After whatever-it-is occurs, you look and see if there are clues.  You
also of course try to type some new shell commands; see if anything's

Short of that, looking in logfiles is always a really good place to
start looking for trouble symptoms, as proceses in trouble have a strong
tendency to mutter in logfiles.

> In the old days, my first guess actually would've been that linux
> itself is rock solid, but I'm afraid linux has seemed increasingly
> flaky of late.  I've seen this hard lock-up symptom on a number of
> thinkpads, particularly when running a media player like vlc or totem.

Might be bad RAM, you know.  About that:

That's my tutorial of how to _thoroughly_ test RAM.

'Hope that helps!  Good meaty problem, thanks!

Cheers,                                "Why struggle to open a door between us,
Rick Moen                              when the whole wall is an illusion?"
rick at linuxmafia.com                                                     -- Rumi
McQ! (4x80)

More information about the svlug mailing list