[svlug] System Panic Makes My Life Easier

Ivan Sergio Borgonovo mail at webthatworks.it
Fri Jul 29 10:07:00 PDT 2016


On 07/29/2016 02:44 AM, Rick Moen wrote:
> Quoting Ivan Sergio Borgonovo (mail at webthatworks.it):
>
>> Hitting *hard* hardware in the proper way to test it is *hard*.
>> You could get an idea about it looking at what memtest does, and I'm not
>> even sure memtest covers all memory technology.

> Memtest will not always catch bad RAM.

I just chatted with an ex amazon engineer and he said that not too long 
ago (he just changed job) they were still using memtest.

> The method whose links I posted upthread, which is running iterative
> kernel compiles in a loop with 'make -j N' for sufficiently high values
> of N to exercise _all_ RAM, does.  Details in the links.

That's like to ask an elephant to drive your car and then check if the 
air conditioning worked. You ain't even testing the suspensions.

CTCS and its successor most probably were better engineered than 
something that does some computation on CPU and send it to the disk 
but... supposing your computer hang during a kernel compilation, which 
part of your PC failed?
And what if sometimes the kernel get compiled but in a wrong way? Are 
you going to compare the images?

Indeed CTCS has a specific memory test.
They even have specific CPU tests for different CPU, an MMX test... so 
they didn't thought compiling kernel was exhaustive.

gcc may actually use MMX instructions... but implementation on different 
CPU may be different. I wonder if MMX instructions are actually executed 
by specific hardware or just decoded and sent in parallel to different 
ALU. Let alone other SIMD instructions that may use FPU or whatever they 
put in modern CPU.

>> If I had to test hardware I would, as Rick suggested boot from a live
>> distro, possibly one specialized in testing hardware... and well that's
>> exactly what you find if you google it ;)
>>
>> http://www.inquisitor.ru/about/

> Please note that VA-CTCS was what VA Linux Systems, Inc. used to
> torture-test hardware.  It was used for multiple days of burn-in per
> unit at the factory, and it was used for multiple days of burn-in on all
> returned units received under RMA.

Testing hardware is *hard* and CTCS and its successors definitively 
help. Just *choosing* the tools you need to test different parts is a 
*lot* of work. Putting them together in the same place in a way they can 
easily be used and building around them a test suite is some more work.
CTCS2 is definitively a good place to start but last update dates back 
to 2008 and possibly there is something better gluing together more 
modern specific tools.

Most probably inquisitor share some of the tools with CTCS.

> That's why it's my go-too for general hardware stress-testing to this
> day, although I'd go straight to the 'make -j $BIGNUM' iterative kernel
> compiling for RAM-testing and skip memtest.

Compiling the kernel as a diagnostic procedure respond to 2 use cases:
- is my hardware or my software fucked up?
- am I shipping something that works?

Compiling the kernel is not going to answer if your RAM or your CPU are 
broken.

But even answering that broader question comes with caveats.
If I haven't been able to relate crashes to the use of any particular 
program the "software" problem may be related to the kernel I'm running 
or other daemons running and the libraries they depends on including 
stdlib on which a lot of kernel compilation depends on.
So a live image is crucial to exclude your software is fucked up.

Compiling the kernel is a shortcut if you mainly want to exclude an 
hardware problem because you suspect it is actually a software problem.
If you suspect it is an hardware problem and you plan to fix it... at 
the end of the day you'll have to know if it is related to any 
replaceable component, if it is a RAM or a CPU or a motherboard issue.
And compiling the kernel it's not going to help that much.
You'll need specific tools.

And... it may even not be a [CPU, RAM, disk] problem, field-tested!

Maybe it is a videoboard problem and you may try to ssh into the box. 
I've no idea how much the kernel is resilient to hardware videoboard 
problems.

Most of the time is not economical even to find it out.
In fact my friend told me that if something fail if it cost less than $x 
or it is older than $y they are simply going to throw it away, otherwise 
they have testing facilities where they run specific tools for disks 
controllers, memory (memtest), disks and few other things.
And his guestimate of $x was around $1500.
If you're using high end CPU, a lot of RAM and a decent disk controller 
that threshold is pretty low. Furthermore decent modern servers have ECC 
and specific hardware to make easier to diagnose what is broken.

Let's think about workstations in a job place with some kind of on site 
warranty service... I bet they are going to take the whole mobo out and 
replace it.

If you're not using expensive hardware and you're not on minimum wage 
probably if you can't quickly find the problem reliably, you'd better 
throw away everything. If you're working with expensive hardware, you're 
making money on it and downtime has a cost.

What's left are gamers screwing their hardware over-clocking and geeks 
that fall in love with their 10 years old box ;)

Diagnosing hardware is something that is better to leave to the vendor.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it http://www.borgonovo.net




More information about the svlug mailing list