[svlug] System Panic Makes My Life Easier
Ivan Sergio Borgonovo
mail at webthatworks.it
Fri Jul 29 10:07:00 PDT 2016
On 07/29/2016 02:44 AM, Rick Moen wrote:
> Quoting Ivan Sergio Borgonovo (mail at webthatworks.it):
>> Hitting *hard* hardware in the proper way to test it is *hard*.
>> You could get an idea about it looking at what memtest does, and I'm not
>> even sure memtest covers all memory technology.
> Memtest will not always catch bad RAM.
I just chatted with an ex amazon engineer and he said that not too long
ago (he just changed job) they were still using memtest.
> The method whose links I posted upthread, which is running iterative
> kernel compiles in a loop with 'make -j N' for sufficiently high values
> of N to exercise _all_ RAM, does. Details in the links.
That's like to ask an elephant to drive your car and then check if the
air conditioning worked. You ain't even testing the suspensions.
CTCS and its successor most probably were better engineered than
something that does some computation on CPU and send it to the disk
but... supposing your computer hang during a kernel compilation, which
part of your PC failed?
And what if sometimes the kernel get compiled but in a wrong way? Are
you going to compare the images?
Indeed CTCS has a specific memory test.
They even have specific CPU tests for different CPU, an MMX test... so
they didn't thought compiling kernel was exhaustive.
gcc may actually use MMX instructions... but implementation on different
CPU may be different. I wonder if MMX instructions are actually executed
by specific hardware or just decoded and sent in parallel to different
ALU. Let alone other SIMD instructions that may use FPU or whatever they
put in modern CPU.
>> If I had to test hardware I would, as Rick suggested boot from a live
>> distro, possibly one specialized in testing hardware... and well that's
>> exactly what you find if you google it ;)
> Please note that VA-CTCS was what VA Linux Systems, Inc. used to
> torture-test hardware. It was used for multiple days of burn-in per
> unit at the factory, and it was used for multiple days of burn-in on all
> returned units received under RMA.
Testing hardware is *hard* and CTCS and its successors definitively
help. Just *choosing* the tools you need to test different parts is a
*lot* of work. Putting them together in the same place in a way they can
easily be used and building around them a test suite is some more work.
CTCS2 is definitively a good place to start but last update dates back
to 2008 and possibly there is something better gluing together more
modern specific tools.
Most probably inquisitor share some of the tools with CTCS.
> That's why it's my go-too for general hardware stress-testing to this
> day, although I'd go straight to the 'make -j $BIGNUM' iterative kernel
> compiling for RAM-testing and skip memtest.
Compiling the kernel as a diagnostic procedure respond to 2 use cases:
- is my hardware or my software fucked up?
- am I shipping something that works?
Compiling the kernel is not going to answer if your RAM or your CPU are
But even answering that broader question comes with caveats.
If I haven't been able to relate crashes to the use of any particular
program the "software" problem may be related to the kernel I'm running
or other daemons running and the libraries they depends on including
stdlib on which a lot of kernel compilation depends on.
So a live image is crucial to exclude your software is fucked up.
Compiling the kernel is a shortcut if you mainly want to exclude an
hardware problem because you suspect it is actually a software problem.
If you suspect it is an hardware problem and you plan to fix it... at
the end of the day you'll have to know if it is related to any
replaceable component, if it is a RAM or a CPU or a motherboard issue.
And compiling the kernel it's not going to help that much.
You'll need specific tools.
And... it may even not be a [CPU, RAM, disk] problem, field-tested!
Maybe it is a videoboard problem and you may try to ssh into the box.
I've no idea how much the kernel is resilient to hardware videoboard
Most of the time is not economical even to find it out.
In fact my friend told me that if something fail if it cost less than $x
or it is older than $y they are simply going to throw it away, otherwise
they have testing facilities where they run specific tools for disks
controllers, memory (memtest), disks and few other things.
And his guestimate of $x was around $1500.
If you're using high end CPU, a lot of RAM and a decent disk controller
that threshold is pretty low. Furthermore decent modern servers have ECC
and specific hardware to make easier to diagnose what is broken.
Let's think about workstations in a job place with some kind of on site
warranty service... I bet they are going to take the whole mobo out and
If you're not using expensive hardware and you're not on minimum wage
probably if you can't quickly find the problem reliably, you'd better
throw away everything. If you're working with expensive hardware, you're
making money on it and downtime has a cost.
What's left are gamers screwing their hardware over-clocking and geeks
that fall in love with their 10 years old box ;)
Diagnosing hardware is something that is better to leave to the vendor.
Ivan Sergio Borgonovo
More information about the svlug