[Volunteers] "Foo of the Month"

Karsten M. Self kmself at ix.netcom.com
Tue Jul 5 21:58:03 PDT 2005

on Tue, Jul 05, 2005 at 11:52:56AM -0700, Chris Verges (chverges at cisco.com) wrote:
> J. Paul Reed wrote:
> >I will, of course, be happy to help; we basically need someone to just introduce someone else to speak about a particular topic for 10-15 minutes; a new application or a small HOWTO or something like that.
> >  
> >
> Same here, let me know if I can do anything.
> >Suggestions for "Foos of the month" included screen, gaim, ssh, Mozilla/Firefox, gimp, how to compile a kernel, etc.
> >  
> >
> I can give a brief intro to smartmontools, too.  (Allows you to check 
> the health of a hard drive.)

ObAOL on thinking that's a good topic.

I've been adding smartmontools to my systems, but am having sort of
mixed experiences.  Nothing _wrong_, per se, but:

  - For those of us with ancient hardware (say, pre-1999 or
    thereabouts), a lot of drives support only a small subset of SMART
    feature.  I'm not sure if they're SMA or ART....

  - I've had a couple of drives fail, or seem to fail...but not really
    show anything under smartmontools tests.   One simply performed
    exceedingly slowly, enough so that the long test never completed.
    I ran it more-or-less overnight, decided that getting data off the
    drive was more important than proving it was bad.  Another started
    throwing CRC errors, so I swapped it out, but it never really showed
    any SMART errors.  I guess pointers on borderline cases might be

  - There's some good industry whitepapers on the topic.  Seagate's in
    particular IIRC were clear, concise, and readable (rare
    characteristics), and also addressed the related issues of drive
    mortality over time, and drive life impacts of ambient temperature.
    Short story:  the likelihood a given drive will fail (if it hasn't
    already) over time actually decreases -- this supports the "sudden
    infant death" theory of electronic components.  And lifetime
    degredation with increasing temperature is suprisingly drastic.
    Keep 'em boxen cool!

The basics of SMART *are* very simple:  the short test almost *always*
shows a true bad drive (low false positive rate), but may incorrectly
show a bad drive as good (false negative).  But it's a short, fast test
(a few minutes, tops).  The long test is far more precise, but takes a
good 30 minutes or so to run.  Both can be run without need to take the
system down or grossly affect other system performance, and most
GNU/Linux packages will schedule regular runs of both tests with results
sent to root.

A related discussion might be to cover other signs of a sick disk.
Usually something like "drive seek complete error..." type syslog
messages.  Which ones to look for, which ones indicate configuration
issues, which ones are *really* bad, and which suggest additional
investigation is needed.

Oh, and check your cabling ;-)


Karsten M. Self <kmself at ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    The support contract said RHEL 3.0 or better, so I installed Debian
    - Peter Samuelson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.svlug.org/archives/volunteers/attachments/20050705/1ef99cc1/attachment.bin

More information about the volunteers mailing list