Flash lifetime can’t be ignored. Late last year, Tesla had problems with the flash storage memory in its connected cars. The company was in the news again this month, when the Office of Defects Investigations released their report summarizing the MCU failures that affect approximately 159 thousand vehicles. This is interesting as much for the report as for the reaction among embedded developers, some of whom still don’t understand that flash media has a limited lifetime.
Examining the metrics, visualizing the costs
Let’s take a closer look at the report itself. The Office of Defects Investigations report noted that there were 2,936 complaints, with thankfully no injuries, fatalities, crashes, or fires. Another 12,523 warranty and non-warranty claims for MCU replacements are also factored into this report. It is good that none of these MCU failures are directly related to safety. The closest problems related to total failure seem to be loss of HVAC controls (for defogging windows) and the Advanced Driver Assistance Support (ADAS).
What was interesting about the report are Tesla’s internal metrics for measuring the flash media wear in the vehicle. Each erase block on the Hynix media is rated for 3000 Program/Erase (P/E) cycles in total. Tesla described nominal daily P/E cycle use as a rate of 0.7 per block, and estimated for that rate that 11-12 years would be required to accumulate a total of 3000 P/E cycles per block. For the 8 GB media, that would work out to 5.6 GB written to the media per day. The file system writes considerably less than that, of course, due to write amplification.
Also highlighted were higher rate users, the 95th percentile of daily use. Tesla expected their P/E cycle use to rate as high as 1.5, where it would take 5-6 years to accumulate the maximum P/E cycles.
The rates of 0.7 and 1.5 are dependent on the chosen media and available space, of course. As of May 2020, Tesla remanufacturing began producing spare parts incorporating a Micron eMMC with 64 GB of storage. This should also bring those rates down by a factor of 8 – assuming the Micron part has a similar P/E cycle lifetime.
Importantly, all the complaints and claims for MCU replacements represent just a small percentage of 159 thousand Model S and Model X vehicles. Tesla did indicate that MCU failures are likely to continue to occur in subject vehicles until 100% of units have failed. An expensive replacement of either the media or the entire MCU board is the only alternative. Tesla has admitted as much, recently informing its customers by email that the eMMC in its faulty vehicles is warranted for 8y/160 kkm, and that they will replace it from 8 GB to 64 GB. Tesla has also agreed to reimburse old repairs. All in all, a costly outcome.
Can patching up the damage be enough?
Tesla has not been idle. Through OTA updates, Tesla has already released 6 firmware patches to help deal with the problem. These patches have at least tried to alleviate previously mentioned loss of HVAC controls and ADAS problems first. The patches overall have ranged from removing high frequency system log messages and suppressing informational log messages when no user was in the vehicle, to increasing the journal commit interval and reducing the settings database write frequency.
Unfortunately, these firmware patches are unlikely to be enough. Once P/E cycles are used, they cannot be regained without replacing the media. A patch late in the cycle will, at best, add only a year of life to the vehicle.
It is also unlikely that future automotive designs will be able to solve problems with reduced logging. If anything, data recording is expected to grow over the next decade. Hypervisors and domain controllers collect data from multiple sensors, storing to common media devices. Another larger source of growth will be autonomous vehicles, with multiple video streams and even more sensor data. These factors highlight the continuing importance of edge storage in the vehicle, as well as proper flash memory management.
Understand the storage stack – before things go wrong
So where should Tesla go from here to deal with all this? Tuxera, have encountered issues like Tesla’s numerous times. Their recommendation remains the same as when they wrote about this topic a year ago. Namely, that a complete and correct understanding of the memory devices (and their limitations) and other software components related to data management (the file system and flash management) are key to understanding systems that are designed to be robust. This is the approach that guides our continued collaboration with customers and partners on activities such as workload analysis, lifetime estimation, write amplification measures, and ultimately the selection of that data management software.
The Office of Defects Investigations report paints a picture of the potential damage that can result from an incomplete understanding of a vehicle’s storage stack. With proper flash memory testing methods unique to the needs of a given use case, flash memory failures can more effectively be prevented.
In 2019, Datalight became a wholly owned subsidiary of Tuxera. Founded in 1983 Datalight have been at the forefront of developing reliable file systems and flash memory drivers for the embedded market. For more information on how to add resilience into your file system please complete the form below.