You are here: Home / RTLWS 1999-2017 / RTLWS Submitted Papers / 
2022-08-12 - 14:02

Real Time Linux Workshops

1999 - 2000 - 2001 - 2002 - 2003 - 2004 - 2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011 - 2012 - 2013 - 2014 - 2015

15th Real Time Linux Workshop, October 28 to 31, 2013 at the Dipartimento Tecnologie Innovative, Scuola Universitaria Professionale della Svizzera Italiana in Lugano-Manno, Switzerland

Announcement - Call for participation (ASCII)Hotels - Directions - AgendaPaper Abstracts - Presentations - Registration - Abstract Submission - Sponsors - Gallery

"Embers in the ashes" or how to squeeze diagnostic information out of a crashed Linux system

The Linux kernel provides a number of very powerful diagnostic tools for tracing and debugging; in consequence, most kernel problems can be located and fixed in a couple of hours. In rare cases, however, the kernel may simply stop execution and be unwilling to provide any information what core is executing what instruction and why it is blocking. Such problems sometimes may require months – if not years – until they get located and fixed. They often are related to a race condition as a result of an erroneous locking strategy; in consequence, the more often a system is preempted, the higher is the probability of such crashes. Or in other words, a PREEMPT_RT-equipped Linux kernel will suffer more often from silent crashes than the non-preemptive standard Linux.

Having in mind that Linux is planned to be used in safety-critical environments such as, for example, in railway control and driving assistance systems, it is mandatory to fix all Linux kernel bugs that cause a Linux system to randomly stop execution. The prerequisite of such bug fixing is the availability of adequate tools.

A well-known method of investigating silent crashes is the SysRq break mechanism that can be triggered via keyboard or even network ICMP signaling. Such triggers can be used to dump the state of all tasks, inspect timers, force kernel panic etc. But what can we do, if keyboard and network have crashed as well? For this purpose, the non-maskable interrupt (NMI) was invented. Specific interrupt handlers can be supplied to let the NMI trigger a particular SysRq command, if a related input bit is set. The standard parallel port or GPIO control pins can be used for this purpose. A simple parallel port plug is available at OSADL along with a related kernel driver. But what can we do, if the industry decided to replace NMI-equipped architectures with chips such as ARM processors that do not have an NMI?

This paper describes in detail the various options to investigate a silently crashed Linux system based on examples of recently detected and fixed Linux kernel locking bugs. It also reports on discussions with ARM engineers how the lack of NMI in this architecture can be supplemented by other diagnostic tools and strategies. We need to make sure that whenever a Linux system crashes the embers in its ashes are used to understand the underlying mechanism of the crash and a fix can be provided in a reasonable amount of time.