Subject: Exclude defect cpus from being used
From: Carsten Emde <C.Emde@osadl.org>
Date: Sun,  3 Feb 2013 14:22:41 +0100

An Intel i7-980X multi-core processor regularly crashed with the message

mce: [Hardware Error]: CPU 8: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406c01d 
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 5 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406bfc9
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 4 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal Machine check
panic occurred, switching back to text console

After the below kernel patch was applied and the kernel parameter
  defect_cpus=2,8
added to the kernel command line, the remaining 5 x 2 cores are
working properly.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>

---
 Documentation/kernel-parameters.txt |    9 +++++++++
 kernel/cpu.c                        |   15 +++++++++++++++
 2 files changed, 24 insertions(+)

Index: linux-3.12.31-rt45/Documentation/kernel-parameters.txt
===================================================================
--- linux-3.12.31-rt45.orig/Documentation/kernel-parameters.txt
+++ linux-3.12.31-rt45/Documentation/kernel-parameters.txt
@@ -764,6 +764,15 @@ bytes respectively. Such letter suffixes
 			Defaults to the default architecture's huge page size
 			if not specified.
 
+	defect_cpus=	[SMP] Exclude defect cpus from being used
+			Format:
+			<cpu number>,...,<cpu number>
+			or
+			<cpu number>-<cpu number>
+			(must be a positive range in ascending order)
+			or a mixture
+			<cpu number>,...,<cpu number>-<cpu number>
+
 	dhash_entries=	[KNL]
 			Set number of hash buckets for dentry cache.
 
Index: linux-3.12.31-rt45/kernel/cpu.c
===================================================================
--- linux-3.12.31-rt45.orig/kernel/cpu.c
+++ linux-3.12.31-rt45/kernel/cpu.c
@@ -684,6 +684,15 @@ out:
 EXPORT_SYMBOL(cpu_down);
 #endif /*CONFIG_HOTPLUG_CPU*/
 
+static cpumask_var_t __cpuinitdata cpu_defect_map;
+static int __init setup_defect_cpus(char *str)
+{
+	alloc_bootmem_cpumask_var(&cpu_defect_map);
+	cpulist_parse(str, cpu_defect_map);
+	return 0;
+}
+early_param("defect_cpus", setup_defect_cpus);
+
 /* Requires cpu_add_remove_lock to be held */
 static int _cpu_up(unsigned int cpu, int tasks_frozen)
 {
@@ -747,6 +756,12 @@ int cpu_up(unsigned int cpu)
 	pg_data_t	*pgdat;
 #endif
 
+	if (cpumask_test_cpu(cpu, cpu_defect_map)) {
+		set_cpu_present(cpu, 0);
+		pr_warn("Can't online cpu %u. It's marked defect.\n", cpu);
+		return -ENODEV;
+	}
+
 	if (!cpu_possible(cpu)) {
 		printk(KERN_ERR "can't online cpu %d because it is not "
 			"configured as may-hotadd at boot time\n", cpu);