From 011d8261117249eab97bc86a8e1ac7731e03e319 Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Mon, 27 Mar 2017 11:33:02 +0200 Subject: RAS: Add a Corrected Errors Collector Introduce a simple data structure for collecting correctable errors along with accessors. More detailed description in the code itself. The error decoding is done with the decoding chain now and mce_first_notifier() gets to see the error first and the CEC decides whether to log it and then the rest of the chain doesn't hear about it - basically the main reason for the CE collector - or to continue running the notifiers. When the CEC hits the action threshold, it will try to soft-offine the page containing the ECC and then the whole decoding chain gets to see the error. Signed-off-by: Borislav Petkov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-edac Link: http://lkml.kernel.org/r/20170327093304.10683-5-bp@alien8.de Signed-off-by: Ingo Molnar --- arch/x86/ras/Kconfig | 14 ++++++++++++++ 1 file changed, 14 insertions(+) (limited to 'arch/x86/ras') diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig index 0bc60a308730..2a2d89d39af6 100644 --- a/arch/x86/ras/Kconfig +++ b/arch/x86/ras/Kconfig @@ -7,3 +7,17 @@ config MCE_AMD_INJ aspects of the MCE handling code. WARNING: Do not even assume this interface is staying stable! + +config RAS_CEC + bool "Correctable Errors Collector" + depends on X86_MCE && MEMORY_FAILURE && DEBUG_FS + ---help--- + This is a small cache which collects correctable memory errors per 4K + page PFN and counts their repeated occurrence. Once the counter for a + PFN overflows, we try to soft-offline that page as we take it to mean + that it has reached a relatively high error count and would probably + be best if we don't use it anymore. + + Bear in mind that this is absolutely useless if your platform doesn't + have ECC DIMMs and doesn't have DRAM ECC checking enabled in the BIOS. + -- cgit v1.2.3-70-g09d2