Advanced hardware error handling for x86 Linux

Bad page offlining

A common class of memory errors is a single "stuck bit" in a DIMM. The bit stays stuck in a specific state and cannot be rewritten anymore. Other bits in the same DIMM or on the same channel are not affected.

With ECC DIMMs this error can be corrected: it is not immediately an fatal problem. But when another nearby bit gets corrupted for some reason this could develop into an uncorrected 2bit error. In addition the stuck bit will generate regular continuous corrected error reports when the memory scrubber scrubs it again. Handling these reports takes some time and may drown error thresholds for other purposes. It does not actually tell anything new.

The best strategy is to simply stop using the bit. The only entity which has reasonable fine control over that is the operating system. It manages memory by pages (typically 4K of size) and it's possible to offline the page containing the stuck bit.

When running in daemon mode mcelog keeps track of corrected memory errors per 4K pages and maintains error counters for each page. This is controlled using the [page] section in mcelog.conf mcelog defaults to page tracking enabled by default (if the CPU supports it) with offlining of a specific page when a threshold of 10 errors per 24 hours is crossed.

Linux starting with 2.6.33 (and in some 2.6.32 kernels with backports) have a page soft-offlining capability. That is the contents of the page are copied somewhere else (or dropped if not needed) and the original page is removed from the normal operating system memory management and not used anymore.

The capability is called soft-offlining because it never kills or otherwise affects any application, in contrast to the "hard-offlining" that is done when an uncorrected recoverable data error happens.

One caveat is that offlining doesn't work for all pages, only pages in specific states. However in common workloads the majority of memory can be soft-offlined.

Hardware

Bad page offlining works on CPUs that provide an physical address on corrected memory machine check errors. This is generally CPUs with integrated memory controller and ECC memory support. On Intel Xeon 75xx, 65xx, E7 (Westmere) series CPUs a special driver has to be loaded for this and the BIOS has to enable the "firmware first" functionality.