Real Time Linux Workshops
1999 - 2000 - 2001 - 2002 - 2003 - 2004 - 2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011 - 2012 - 2013 - 2014 - 2015
14th Real Time Linux Workshop, October 18 to 20, 2012 at the Department of Computer Science, University of North Carolina at Chapel Hill
Announcement - Call for papers (ASCII) - Hotels - Directions - Agenda - Paper Abstracts - Presentations - Registration - Abstract Submission - Sponsors - Gallery
Real-Time Linux, Multicore ARM and Fast Context Switch a couple of year later: performance analysis
Vanni Genua, Consorzio Roma Ricerche
Luca Recchia, MBDA
Nicola Baroncini, Selex Elsag
Mauro Olivieri, La Sapienza University of Rome
The following research deals with a set of real-time performance analyses of Open Source Software based, hard real-time, ARMv7 based embedded systems such as Pandaboard. Hence they are applied as industrial tests for safety/mission critical systems, airborne systems, DO178B compliant systems.
The first step has been porting Linux kernel 3.0.27 with real-time preemption patch rt46 on Pandaboard Rev.A2. Pandaboard Rev.A2 is an embedded system made up by a Texas Instrument OMAP4430 chip, which adopts an ARM Cortex A9 dual-core CPU. ARM Cortex A9 is a 32-bit processor having ARMv7 instruction set and Corsight unit, that is a parallel, low overhead, high precision set of performance counters.
Linux kernel 3.0.27 has been chosen due to ARMv7 and Fast Context Switch Extension (FCSE registers) support. The FCSE reduces cache flushing and performance penalty by the mean of virtual addresses shared among several processes [1].
With respect to monocores, multicore performance analysis is harder, having to take into account more features that affect the worst case execution time (WCET): shared caches, multiple pipelines, preemption, priority inversion and some scheduling phenomena can cause timing anomalies which imply missing hard real time deadlines.
In the present research, several performance analysis tools and test codes have been patched, modified, extended and used for getting out an overview and a general methodology to test ARMv7 systems running linux 3.0.27-rt46 kernel. In order not to narrow the application field to a few codes, also in order to allow multithreaded programming in hard real time context, statistical, dynamic measurement based, kernel based, performance counter based approaches to HRT performance analysis have been adopted.
The analysis has moved by applying the following toolchain:
- integrating Perf, kernelshark, Trace-cmd and enabling Ftrace to trace wakeup and IRQ latencies, context switches, page-fasults, CPU migrations and process state events
- patching and integrating LTTng2.0, Babeltrace and lttng-graph on ARMv7: timestamp precision 1ns, detailed hardware events and kernel/userspace tracepoints list, Coresight support
- employing Malardalen benchmarks and real time tests to compare results coming out from the above tools varying input parameters.
The following testcases have been executed:
- 12h cyclictest: varying the number of threads, thread priority, core affinity (it determines which thread is assigned to which core), number of loops, interval between threads, duration, break threshold, system load (e.g. executing hackbench on an ssh shell)
- hackbench: varying the message length, number of threads per group, number of groups; thus getting out execution time
- rt-migrate-test: varying priority of threads that are going to be migrated, the number of loops, the number of children threads
- page-fault latency tests: simple/dynamic memory lock tests
- pmq test: POSIX message queues test; varying thread priority, the number of threads, core affinity; getting out latencies (Min, Avg, Max)
- Malardalen benchmarks: not as standalones, but through tracing tools
- pi_stresstest to test priority inversion
The above mentioned testcases have been executed comparatively by themselves, called by LTTng, called by Trace-cmd, called by Perf, plotted by kernelshark or by lttng-graph. Results depend on input parameters each time having been passed. Main latency causes are this way detectable.
Conclusions
The illustrated performance analysis is worthwhile because multithread/multicore systems performance is not stated by a simple number (e.g., WCET). Many factors can affect execution time, such as cache sharing , page-faults, context switches, priorities. Therefore execution time analysis should provide different points of view either for detecting bottlenecks or for tuning the system to improve performance and to reduce latencies.
A complete performance analysis has to consider both kernel events (eg. Wakeup latency spent by the kernel to launch/restore a process, IRQ latency spent by the kernel to execute an interrupt handler) and hardware events (e.g timestamps read from a special purpose time register, Program Flow Traces read from a Coresight unit are tighter and less intrusive way of measuring delays). Different kinds of tests, such as mono-thread/multi-thread, interfearencing, synchronized or with specified affinity/priority, dig up context dependent behavior of ARMv7 architectures.
Resolving these issues is what exactly needed in an industrial context, expecially because, through Open Source tools and codes, it is possible to produce thorough detailed system performance estimations which fulfil mission/safety critical analyses demands.
Also because multi tool analyses allow to compare results from a tool with the ones from other tools, thus increasing the reliability of measurements and statistics. The last but not the least, graphical tools provide immediate and comprehensible scenarios, e.g. the IRQ latencies shown by kernelshark and the number of context switches per time unit shown by lttng-graph.
References
[1] “Using the real-time preemption patch on ARM CPUs”, Jan Altenberg, 2009. Linutronix GmbH
[2] “Dynamic memory allocation on real-time Linux”, Jianping Shen, Michael Hamal, 2011. Institut Dr. Foerster GmbH und Co. KG
[3] “Better Trace for Better Software”, Roberto Mijat, 2010. ARM Ltd
[4] “A survey of WCET analysis of real-time operating systems”, Mingsong Lv, Nan Guan, Yi Zhang, Qingxu Deng, Ge Yu, Jianming Zhang, 2009. Northeastern University Shenyang
[5] “When Do Real Time Systems Need Multiple CPUs?”, Paul McKenney, 2010. IBM