[svlug] More links about ssds

Karen Shaeffer shaeffer at neuralscape.com
Thu Jun 2 15:50:11 PDT 2016


On Thu, Jun 02, 2016 at 11:38:28AM -0700, John Conover wrote:
> 
> Since SSD memory elements are semiconductor based, it is probably a
> reasonable assumption that failures are ergodic, (including
> write/erase cycles.)

Hi John,
Agreed to the extent it is relevant. Most semiconductor failures of the chip
core are tied to the fabrication process. Extremely small levels of
contaminants and extremely small variances in the fabrication process can have
a dramatic affect on MTBF. Of course, as you mention, temperature is a first
order parameter in such failure modes. Temperature is a 3rd power exponential
in semiconductor reliability equations.

But what is rather interesting is that most chip failures are due to the
external pin connections to the chip. The board level manufacturing process
can produce defective pin connections and also impart excessive heat to the pin
that is absorbed right into the chip core. It is a well known fact that the
most important factor in "failure of a chip" is it's pin count.

I claim consumers who buy laptops with SSD storage have far higher risk of
suffering a generic integrated circuit failure mode related to less than
optimal chip fabrication, temperature of operation, physical stress of the
integrated circuit board, and board level manufacturing thermal chip damage or
board level manufacturing related defective chip pin connections than from
actually wearing out the write capacity of the SSD.

Sarah quoted one of those research papers that claimed the SSD died with no
warning related to write capacity. Sounds very likely to be generic integrated
circuit failure mode was the root cause.

Even so, all the discussion about SSD write capacity is very interesting in
its own right. Thanks to everyone for taking the time to express their point of
view. And thanks to Rick for taking the time to do the presentation!

enjoy,
Karen.

> 
> MTBF does not mean how long a device would be expected to last.
> 
> MTBF means that given sufficiently many devices, half would have
> failed by the MTBF, and half would be still be running, (this is NOT
> true if failures are non-ergodic.)
> 
> If a system depends on multiple devices to function, (NOT a stripped
> array; SCSI, for example, which can tolerate at least a single
> failure,) doubling the number of devices reduces the system MTBF by a
> factor of two, (if failures are ergodic.)
> 
>     John
> 
> BTW, further, the MTBF is a function of operating temperature for
> semiconductor devices, (usually specified at 105F ambient for consumer
> products, which means about 70C junction temperature, for a
> "reasonable," design, at 105F ambient, i.e., a commodity PC.)
> 
> Increasing the device temperature 18C decreases the MTBF by a factor
> of two, meaning whatever is going to fail, will fail in half the time
> for a specific device.
> 
> Sarah Newman writes:
> >
> > I get the impression it's more likely for SSDs with the same age and
> > workload to fail concurrently than with hard drives, though I don't
> > know how serious that risk is in practice if the SSDs you're using
> > aren't designed to self-destruct after a certain number of wear
> > cycles.
> >
> > It doesn't hurt to mix manufacturers and/or hours of use within a
> > single array. Unevenly distributed parity (if using parity-based
> > RAID) is another approach to reduce the possibility of having
> > multiple drives fail concurrently:
> > http://www.cs.yale.edu/homes/mahesh/papers/eurosys10-diffraid.pdf
> 
> -- 
> 
> John Conover, conover at rahul.net, http://www.johncon.com/
> 
> _______________________________________________
> svlug mailing list
> svlug at lists.svlug.org
> http://lists.svlug.org/lists/listinfo/svlug
--- end quoted text ---

-- 
Karen Shaeffer                 Be aware: If you see an obstacle in your path,
Neuralscape Services           that obstacle is your path.        Zen proverb



More information about the svlug mailing list