Subject: Exclude defect cpus from being used
From: Carsten Emde <C.Emde@osadl.org>
Date: Sun,  3 Feb 2013 14:22:41 +0100

An Intel i7-980X multi-core processor regularly crashed with the message

mce: [Hardware Error]: CPU 8: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406c01d 
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 5 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406bfc9
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 4 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal Machine check
panic occurred, switching back to text console

After the below kernel patch was applied and the kernel parameter
  defect_cpus=2,8
added to the kernel command line, the remaining 5 x 2 cores are
working properly.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>

---
 Documentation/kernel-parameters.txt |    9 +++++++++
 kernel/cpu.c                        |   14 ++++++++++++++
 2 files changed, 23 insertions(+)

Index: linux-3.12.0-rt1/Documentation/kernel-parameters.txt
===================================================================
--- linux-3.12.0-rt1.orig/Documentation/kernel-parameters.txt
+++ linux-3.12.0-rt1/Documentation/kernel-parameters.txt
@@ -761,6 +761,15 @@ bytes respectively. Such letter suffixes
                        Defaults to the default architecture's huge page size
                        if not specified.
 
+       defect_cpus=    [SMP] Exclude defect cpus from being used
+                       Format:
+                       <cpu number>,...,<cpu number>
+                       or
+                       <cpu number>-<cpu number>
+                       (must be a positive range in ascending order)
+                       or a mixture
+                       <cpu number>,...,<cpu number>-<cpu number>
+
        dhash_entries=  [KNL]
                        Set number of hash buckets for dentry cache.
 
Index: linux-3.12.0-rt1/kernel/cpu.c
===================================================================
--- linux-3.12.0-rt1.orig/kernel/cpu.c
+++ linux-3.12.0-rt1/kernel/cpu.c
@@ -676,6 +676,15 @@ out:
 EXPORT_SYMBOL(cpu_down);
 #endif /*CONFIG_HOTPLUG_CPU*/
 
+static cpumask_var_t __cpuinitdata cpu_defect_map;
+static int __init setup_defect_cpus(char *str)
+{
+       alloc_bootmem_cpumask_var(&cpu_defect_map);
+       cpulist_parse(str, cpu_defect_map);
+       return 0;
+}
+early_param("defect_cpus", setup_defect_cpus);
+
 /* Requires cpu_add_remove_lock to be held */
 static int _cpu_up(unsigned int cpu, int tasks_frozen)
 {
@@ -739,6 +748,11 @@ int cpu_up(unsigned int cpu)
        pg_data_t       *pgdat;
 #endif
 
+       if (cpumask_test_cpu(cpu, cpu_defect_map)) {
+               pr_warn("Can't online cpu %u. It's marked defect.\n", cpu);
+               return -ENODEV;
+       }
+
        if (!cpu_possible(cpu)) {
                printk(KERN_ERR "can't online cpu %d because it is not "
                        "configured as may-hotadd at boot time\n", cpu);