Subject: Exclude defect cpus from being used
From: Carsten Emde <C.Emde@osadl.org>
Date: Sun,  3 Feb 2013 14:22:41 +0100

An Intel i7-980X multi-core processor regularly crashed with the message

mce: [Hardware Error]: CPU 8: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406c01d 
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 5 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406bfc9
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 4 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal Machine check
panic occurred, switching back to text console

After the below kernel patch was applied and the kernel parameter
  defect_cpus=2,8
added to the kernel command line, the remaining 5 x 2 cores are
working properly.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>

---
 Documentation/kernel-parameters.txt |    9 +++++++++
 kernel/cpu.c                        |   14 ++++++++++++++
 2 files changed, 23 insertions(+)

Index: linux-4.4.39-rt50/Documentation/kernel-parameters.txt
===================================================================
--- linux-4.4.39-rt50.orig/Documentation/kernel-parameters.txt
+++ linux-4.4.39-rt50/Documentation/kernel-parameters.txt
@@ -876,6 +876,15 @@ bytes respectively. Such letter suffixes
                        Defaults to the default architecture's huge page size
                        if not specified.
 
+       defect_cpus=    [SMP] Exclude defect cpus from being used
+                       Format:
+                       <cpu number>,...,<cpu number>
+                       or
+                       <cpu number>-<cpu number>
+                       (must be a positive range in ascending order)
+                       or a mixture
+                       <cpu number>,...,<cpu number>-<cpu number>
+
        dhash_entries=  [KNL]
                        Set number of hash buckets for dentry cache.
 
Index: linux-4.4.39-rt50/kernel/cpu.c
===================================================================
--- linux-4.4.39-rt50.orig/kernel/cpu.c
+++ linux-4.4.39-rt50/kernel/cpu.c
@@ -802,6 +802,15 @@ void smpboot_thread_init(void)
        register_cpu_notifier(&smpboot_thread_notifier);
 }
 
+static cpumask_var_t cpu_defect_map;
+static int __init setup_defect_cpus(char *str)
+{
+       alloc_bootmem_cpumask_var(&cpu_defect_map);
+       cpulist_parse(str, cpu_defect_map);
+       return 0;
+}
+early_param("defect_cpus", setup_defect_cpus);
+
 /* Requires cpu_add_remove_lock to be held */
 static int _cpu_up(unsigned int cpu, int tasks_frozen)
 {
@@ -858,6 +867,11 @@ int cpu_up(unsigned int cpu)
 {
        int err = 0;
 
+       if (cpumask_test_cpu(cpu, cpu_defect_map)) {
+               pr_warn("Can't online cpu %u. It's marked defect.\n", cpu);
+               return -ENODEV;
+       }
+
        if (!cpu_possible(cpu)) {
                pr_err("can't online cpu %d because it is not configured as may-hotadd at boot time\n",
                       cpu);