[svlug] System Panic Makes My Life Easier

Karen Shaeffer shaeffer at neuralscape.com
Fri Jul 29 08:11:29 PDT 2016


On Thu, Jul 28, 2016 at 11:32:03PM -0700, Rick Moen wrote:
> Quoting Karen Shaeffer (shaeffer at neuralscape.com):
> 
> > I understand GPGPUs are key hardware components for consumer platforms. And in
> > the past 5 years GPGPUs have become the hottest technology in datacenters as
> > well. For example:
> > 
> > https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/
> 
> What can I say?  When I find something better than Cerberus (VA-CTCS),
> I'll be glad to recommend it.

Understood.

A good library for stress testing GPGPUs on mainstream platforms would be one
of the open source SIMD libraries. There are others, such as Google's
streamexecutor library, which is in the process of being open sourced in the
parallel-libs subproject of LLVM.

Companies using GPGPUs in their data centers no doubt already have GPGPU stress
testing software they use internally. Eventually, if not true at the moment,
one or more of them will open source their solution. Its just a matter of time.

Thanks for your comments. Always interesting stuff.
Karen

> 
> Meantime, in my experience it's been rather rare for hardware problems
> to reside in GPUs _or_ CPUs -- and when that's the case, it becomes
> rather obvious from the specific symptoms.
> 
> Anyway, I've been meaning to circle back to what Joseph said about past
> cooling problems:
> 
>   The cpu fan has been in bad shape for some time... it's got some 
>   cooling now, but I can easily imagine its lifespan was shortened 
>   by overheating in the past.
> 
> This is astute, sir.  I would, in your shoes, take careful note of the
> fact that heat damage to computing hardware has pernicious long-term
> effects.  I would be very skeptical of the reliability of -any-
> computing hardware that has undergone significant overheating episodes.
> 
> At bare minimum, any component of mine that went through a major
> overheating episode would be classed as suspect from that point forward.
> It would be unlucky for that unreliability to manifest as freeze-ups
> rather than component failure, but that _does happen_.
> 
> Man, though, ten-year-old Opteron systems!  Those were awesome!  That's
> exactly what my employer California Digital Corporation had as its
> bread-and-butter around 2005:  dual-Opteron 2.2GHZ 1U rackmount servers,
> in its case.  Those were fabulous for their day -- and also generated a
> prodigious amount of waste heat from the CPUs:  In some cases where the
> CPU fans seized up, I remember that the motherboard actually got
> charred.
> 
> One way to simplify the diagnostic matrix, Joseph:  Look at a calendar.
> ;->   Ten years is an excellent run, and it just might be time to have
> the awesome experience of ringing out the old and ringing in the new
> (hardware).  
> 
> Nostalgia Ain't What It Used to Be.[tm]
> 
> -- 
> Cheers,                    "A man is his own easiest dupe, for what he wishes
> Rick Moen                  to be true he generally believes to be true."
> rick at linuxmafia.com        -- Demosthenes, Third Olynthiac, sct. 19 (349 BCE)
> McQ! (4x80)
> 
> _______________________________________________
> svlug mailing list
> svlug at lists.svlug.org
> http://lists.svlug.org/lists/listinfo/svlug
--- end quoted text ---

-- 
Karen Shaeffer                 Be aware: If you see an obstacle in your path,
Neuralscape Services           that obstacle is your path.        Zen proverb



More information about the svlug mailing list