Advanced hardware error handling for x86 Linux
Papers and presentations
Studies about memory errors
study on memory errors from the University of Rochester.
"A Realistic Evaluation of Memory Hardware Errors and Software System
Susceptibility", Li, Huang, Shen, Chu: Usenix Annual Tech Conference 2010
google memory error study.
"DRAM Errors in the Wild: A Large-Scale Field Study", Schroeder,
Pinheiro, Weber, SIGMETRICS, 2009.
Note: there are various indications that the google numbers
are significantly higher than in typical servers. It is not recommended to use them for planning purposes.
A classic study on the benefits of automatic bad page offlining:
"Assessment of the Effect of Memory Page Retirement on Systems RAS against
Hardware Faults", Tang, Arruthers, Totari, Shapiro:
Proceedings of the 2006 International Conference on Dependable Systems and
A newer study that gets to the same conclusion. Automatic page offlining is a good idea:
"Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design", Hwang, Stefanovici, Schroeder, ASPLOS 2012 (non-paywalled PDF).
on RAS in recent server processors.
Software Developer's manual describes the low level register
interface of the x86 machine check architecture
Machine checks are described in Volume 3A: System Programming Guide.
mce-inject injector tool and the
mce-test test suite can be used to test machine check.
This is in addition to the mcelog test suite included with the source
Linux EDAC project on sourceforge. EDAC is an alternative approach at reporting memory errors. See also the
EDAC discussion in the FAQ.