[svlug] System Panic Makes My Life Easier

Sarah Newman newmans at sonic.net
Fri Jul 29 11:33:36 PDT 2016


On 07/29/2016 11:09 AM, Rick Moen wrote:
> Quoting Ivan Sergio Borgonovo (mail at webthatworks.it):
> 
>> That's like to ask an elephant to drive your car and then check if the 
>> air conditioning worked. You ain't even testing the suspensions.
> 
> I actually need to walk back and apologise for part of what I said,
> because without entirely meaning to, I claimed that iterative kernel
> compiles is a _general_ tester for RAM.  As you point out, it's
> probably not.  And memtest is.  But what I did _is_ a truly excellent
> tool for (at least) many specific situations to diagnose a possible RAM
> problem where memtest cannot.
> 
> It's been so many years (ten) since the situation arose when I wrote
> those posts that I forgot the specifics, but I've refreshed my memory
> and can clarify:
> 
> I had set up in my dining room a spare VA Linux Systems model 2230 2U
> rackmount unit and was preparing to migrate my Internet server to it.
> But something kept worrying me:  occasional spontaneous reboots -- and
> also at one point a 'NMI: Dazed and confused but struggling to continue' 
> kernel message that correlates strongly with either bad RAM or a bad RAM
> socket.  This was four ECC sticks, _and_ ECC was enabled in the BIOS,
> _and_ all four sticks had passed at least a day of memtest86.  
> 
> Yet, it turned out half the RAM _was_ the cause of the reboots, and
> memtest did not find it -- as I shall explain.
> 

Another reason why the kernel compile may have been finding problems and memtest wasn't - compiling the kernel probably generated a lot more heat than
memtest, and as already discussed hardware can act differently at different temperatures.

Incidentally, hair dryers and an infrared thermometer can be a useful hardware test tool if you don't have a thermal chamber handy. :)

A few other tidbits about memtest86+:

To my best knowledge, older versions of memtest86+ have a 64G limit. I think version 5 is required to test beyond that.

Also memtest86+ has never told us about single bit ECC errors and the bios/ipmi logs might not either. The edac module in Linux warns about
correctable ECC errors, but you need to have a new enough kernel. Patrol scrubbing should be enabled in the BIOS to be alerted earlier about warnings:
https://en.wikipedia.org/wiki/Memory_scrubbing#Variants

--Sarah



More information about the svlug mailing list