Subject: Exclude defect cpus from being used From: Carsten Emde Date: Sun, 3 Feb 2013 14:22:41 +0100 An Intel i7-980X multi-core processor regularly crashed with the message mce: [Hardware Error]: CPU 8: Machine Check Exception: 4 Bank 2: b200000000000005 mce: [Hardware Error]: TSC 1e3406c01d mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 5 microcode 14 mce: [Hardware Error]: Run the above through 'mcelog --ascii' mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 2: b200000000000005 mce: [Hardware Error]: TSC 1e3406bfc9 mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 4 microcode 14 mce: [Hardware Error]: Run the above through 'mcelog --ascii' mce: [Hardware Error]: Machine check: Processor context corrupt Kernel panic - not syncing: Fatal Machine check panic occurred, switching back to text console After the below kernel patch was applied and the kernel parameter defect_cpus=2,8 added to the kernel command line, the remaining 5 x 2 cores are working properly. Signed-off-by: Carsten Emde --- Documentation/kernel-parameters.txt | 9 +++++++++ kernel/cpu.c | 14 ++++++++++++++ 2 files changed, 23 insertions(+) Index: linux-3.12.19-rt30/Documentation/kernel-parameters.txt =================================================================== --- linux-3.12.19-rt30.orig/Documentation/kernel-parameters.txt +++ linux-3.12.19-rt30/Documentation/kernel-parameters.txt @@ -764,6 +764,15 @@ bytes respectively. Such letter suffixes Defaults to the default architecture's huge page size if not specified. + defect_cpus= [SMP] Exclude defect cpus from being used + Format: + ,..., + or + - + (must be a positive range in ascending order) + or a mixture + ,...,- + dhash_entries= [KNL] Set number of hash buckets for dentry cache. Index: linux-3.12.19-rt30/kernel/cpu.c =================================================================== --- linux-3.12.19-rt30.orig/kernel/cpu.c +++ linux-3.12.19-rt30/kernel/cpu.c @@ -684,6 +684,15 @@ out: EXPORT_SYMBOL(cpu_down); #endif /*CONFIG_HOTPLUG_CPU*/ +static cpumask_var_t __cpuinitdata cpu_defect_map; +static int __init setup_defect_cpus(char *str) +{ + alloc_bootmem_cpumask_var(&cpu_defect_map); + cpulist_parse(str, cpu_defect_map); + return 0; +} +early_param("defect_cpus", setup_defect_cpus); + /* Requires cpu_add_remove_lock to be held */ static int _cpu_up(unsigned int cpu, int tasks_frozen) { @@ -747,6 +756,12 @@ int cpu_up(unsigned int cpu) pg_data_t *pgdat; #endif + if (cpumask_test_cpu(cpu, cpu_defect_map)) { + set_cpu_present(cpu, 0); + pr_warn("Can't online cpu %u. It's marked defect.\n", cpu); + return -ENODEV; + } + if (!cpu_possible(cpu)) { printk(KERN_ERR "can't online cpu %d because it is not " "configured as may-hotadd at boot time\n", cpu);