[svlug] System Panic Makes My Life Easier
newmans at sonic.net
Fri Jul 29 11:33:36 PDT 2016
On 07/29/2016 11:09 AM, Rick Moen wrote:
> Quoting Ivan Sergio Borgonovo (mail at webthatworks.it):
>> That's like to ask an elephant to drive your car and then check if the
>> air conditioning worked. You ain't even testing the suspensions.
> I actually need to walk back and apologise for part of what I said,
> because without entirely meaning to, I claimed that iterative kernel
> compiles is a _general_ tester for RAM. As you point out, it's
> probably not. And memtest is. But what I did _is_ a truly excellent
> tool for (at least) many specific situations to diagnose a possible RAM
> problem where memtest cannot.
> It's been so many years (ten) since the situation arose when I wrote
> those posts that I forgot the specifics, but I've refreshed my memory
> and can clarify:
> I had set up in my dining room a spare VA Linux Systems model 2230 2U
> rackmount unit and was preparing to migrate my Internet server to it.
> But something kept worrying me: occasional spontaneous reboots -- and
> also at one point a 'NMI: Dazed and confused but struggling to continue'
> kernel message that correlates strongly with either bad RAM or a bad RAM
> socket. This was four ECC sticks, _and_ ECC was enabled in the BIOS,
> _and_ all four sticks had passed at least a day of memtest86.
> Yet, it turned out half the RAM _was_ the cause of the reboots, and
> memtest did not find it -- as I shall explain.
Another reason why the kernel compile may have been finding problems and memtest wasn't - compiling the kernel probably generated a lot more heat than
memtest, and as already discussed hardware can act differently at different temperatures.
Incidentally, hair dryers and an infrared thermometer can be a useful hardware test tool if you don't have a thermal chamber handy. :)
A few other tidbits about memtest86+:
To my best knowledge, older versions of memtest86+ have a 64G limit. I think version 5 is required to test beyond that.
Also memtest86+ has never told us about single bit ECC errors and the bios/ipmi logs might not either. The edac module in Linux warns about
correctable ECC errors, but you need to have a new enough kernel. Patrol scrubbing should be enabled in the BIOS to be alerted earlier about warnings:
More information about the svlug