Open Source Automation Development Lab
You are here: Home / OSADL / News / 
2014-04-23 - 21:44

Next OSADL Events:

Breaking News:

2014-04-09 12:00

First LCA certificate issued

Major automation provider passed OSADL License Compliance Audit


2014-01-20 12:00

Good news from OIN

The Linux Defense Department


2013-11-06 12:00

SIL2LinuxMP ...

why Open Source/Open Proof is the right way to go for safety



2013-10-07 12:00 Age: 198 Days

Bad things come to those who wait

By: Carsten Emde

Real-time Linux kernel for AM 335x long-time stable now

One of OSADL's most important tasks is the quality assessment and assurance of Open Source software projects – the top Open Source project being the Linux RTOS kernel aka PREEMPT_RT. This is mainly done in the OSADL test center called QA Farm where "QA" stands for both Quality Assessment and Quality Assurance. Currently, the QA Farm hosts a wide variety of Linux-driven CPU boards:

  • ARM, MIPS, PowerPC and x86 base architectures
  • Year of CPU design from 1995 to 2013
  • Address size 32 bit and 64 bit
  • Number of cores from 1 to 32 (1 to 16 per socket)
  • With and without hyperthreads
  • CPU clock frequency from 133 MHz to 3.467 GHz
  • RAM size from 26 MByte to 65.756 GByte
  • Linux kernel versions from 2.6.33 to 3.10
  • Native and virtualized systems

Quality assessment actually is based on continuous monitoring of a large number of variables and a threshold-based alarm and escalation mechanism. Quality assurance means that whenever an anomaly is detected, its origin is documented and the problem is (read: should be) fixed depending on its relevance .

Quality assessment

Quality assessment is the easier part. The figure to the left displays, for example, the results of more than 90 repeated latency tests each based on 100 million wake-up cycles (Click here for a higher resolution of the image). The worst-case latency never was longer than 23 µs. Such quality assessment is included free of extra charge in the OSADL flat-rate service fee (one board per share). The board that delivered these excellent data was evaluated on behalf of an OSADL member who is using it successfully in industrial devices that rely on such outstanding hard real-time capabilities. It is running an Intel G 850 CPU at 2,900 MHz clock frequency. Initially, some minor fine-tuning was necessary to get interrupts originating from 3D graphics out of the way. Thereafter, the board was perfect and did not require any extra work to make it real-time compliant. Only six off-tree patches were applied in addition to the RT_PREEMPT patch; three of them already have been merged into later kernel versions, and the remaining patches add some monitoring functionality to the kernel. None of them is related to real-time.

Quality assurance

Quality assurance is, by far, the harder part. As mentioned earlier, silent kernel crashes are the hardest thing to deal with – and there is no better way to really get frustrated with kernel development than trying to debug a kernel that silently stops execution at random. An example of such a situation happened some time ago when a PREEMPT_RT kernel was applied to an OMAP4 chip made by OSADL member Texas Instruments, the first multi-core ARM processor of the OSADL QA Farm. The phyCORE-OMAP4460 board was provided by OSADL member Phytec and was placed in rack #7/slot #7. Everything worked well during short measurement periods, and the worst-case latency as a marker of the real-time performance was as fast as could be. But when standard continuous monitoring with various load cycles was started, bad things came to us who were waiting and observing the board: The board apparently stopped on average every three to five days, and the only thing to bring it back to work was a cold reset. Any attempt to get debug output was deemed to fail. Unfortunately, the JTAG debugger crashed more often than the board so it was nearly impossible to bring the board to a state where the JTAG debugger survived a board crash. It took nearly five months until, finally, such condition was met and the JTAG interface could be used to investigate the origin of the crash. This was done using repeated halt and go commands to determine the position of the program counter which gave the following result:

Repetitions

Program counter

44

0xc0491e00

50

0xc0491e08

56

0xc0491e10

Disassembling the code at these positions revealed

0xc0491dfc <+120>: ldr r3, [r4]
0xc0491e00 <+124>: cmp r3, #0
0xc0491e04 <+128>: beq 0xc0491da8 <__raw_spin_lock+36>
0xc0491e08 <+132>: ldr r3, [r4, #4]
0xc0491e0c <+136>: cmp r3, #0
0xc0491e10 <+140>: bne 0xc0491dfc <__raw_spin_lock+120>

This code is called from the address space identifier rollover mechanism that works well on uniprocessor ARM CPUs and on mainline Linux but does not on real-time multi-core, since the cores may be waiting for each other to complete the related interprocessor interrupt action and thus wait forever in a livelock. The window width of this race condition is very small which explains why it took so long and required a continuous high load for triggering. After the problem was fixed on mid May 2013, the board suddenly became stable as shown in the impressive uptime graph below. The related posting of the report and the applied patch is here.

Ok, compared to the repeated and frequent crashes before mid May, this board works very well now. But aren't there many other chip and board manufacturers of multi-core ARM boards who are waiting for a miracle instead of acting? Shouldn't they also become OSADL member and let their boards get fixed? Well, bad things come to those who wait.