Note: The mercurial server is disabled at the moment while I investigate whether it can run with an acceptably low CPU load – Mike.
We're going on a bug hunt! (Digital Systems)
The program lab4-microbian/heart.c
works if compiled on the lecturer's machine (running Debian Stretch with compiler A – arm-none-eabi-gcc
at version 5.4.1) but not if compiled on the lab machines (running Fedora something with Compiler B – arm-none-eabi-gcc
version 7.1.0). The program is supposed both to display a beating heart pattern and to print a list of primes. When compiled on the lab machines, it prints
Timer got bad message
on the serial port and shows the Seven Stars of Death. How can we fix this?
Examining the patient
The first impulse is to blame the compiler: the program works when compiled with one version of the compiler, and stops working when we switch to a different one. Clearly, the new compiler version is generating bad code for the timer process. This hypothesis is attractive because it means the bug is not our fault.
Looking at the timer process shows no obvious faults. Adding to the panic()
call in the default arm of the switch
statement (Aren't we glad we included it?) shows that the values of m.m_type
and m.m_sender
are both nonsense, and not the values INTERRUPT = 1
and HARDWARE = -1
that we expected. What's more, a few experiments with disabling parts of it all seem to make the bug go away: if we delete the call to check_timers()
from the process body, the program no longer crashes (though no timers will ever go off). Also, glancing at the object code shows that the compiler has inlined the local function check_timers()
into the body of timer_task()
. Telling it not to do that, by adding the annotation
static void __attribute((noinline)) check_timers(void) { ...
to the heading of check_timer()
also makes the problem disappear (I had to check the GCC manual for that). Now we want to believe that it really is the compiler's fault: there's surely a bug in the part of the compiler that does inlining. Not so fast! The compiler may be generating correct code, but naturally enough, code that only works if it finds itself in the proper environment. For example, code that accesses the stack frame will only work if the stack pointer has a sensible value.
Isolating the problem
If the compiler is at fault, we should be able to identify the problem in the machine code it outputs. We can ask the compiler to produce its output in assembly language, so that we can study it, and so that we can try assembling it separately. This is helpful, because at the moment Compiler A runs on one machine and Compiler B on another. Compiling to assembly language on Machine B, copying the resulting file timer.s
to Machine A, and combining it with other part of the program compiled on Machine A results in a program that still fails. Using instead a file timer.s
prepared on Machine A gives a program that does not fail. Good! The problem will be revealed by comparing the two versions of timer.s
. If we cannot find the bug in B's version of timer.s
, then the compiler is not at fault. Generating assembly language and then assembling it separately should reliably enable us to reproduce the problem, because in fact that is what the compiler itself does. The program called gcc
(or arm-none-eabi-gcc
) is just a wrapper (or "shell") that invokes a compiler-to-assembly-language and the assembler and the linker as separate programs.
One worthwhile approach at this point involves two printouts, a box of sharp pencils, and a cup of tea. Also good is a side-by-side listing of the two files generated with diff -y
. Although the two compiler versions are numerically some distance apart, we are lucky to find that the object code is not drastically different in trivial ways. What differences can we see?
First, the boilerplate at the start of the file has been tidied up a bit. Since (as we know) both compilers are capable of generating code that works, so differences here should not be important.
Compiler A Compiler B ========== ========== .syntax unified < .cpu cortex-m0 .cpu cortex-m0 .fpu softvfp < .eabi_attribute 20, 1 .eabi_attribute 20, 1 ... ... .eabi_attribute 21, 1 .eabi_attribute 21, 1 .eabi_attribute 34, 0 .eabi_attribute 34, 0 .eabi_attribute 18, 4 .eabi_attribute 18, 4 .thumb < .syntax unified < .file "timer.c" .file "timer.c" .text .text .Ltext0: .Ltext0: .cfi_sections .debug_frame .cfi_sections .debug_frame .align 2 | .align 1 .global timer_task .global timer_task > .syntax unified .code 16 .code 16 .thumb_func .thumb_func > .fpu softvfp
Interestingly, the new compiler only aligns code on a 2-byte boundary, but that's not likely to be the problem.
Next, we see that Compiler B has learned a new trick for saving the high registers.
mov r7, r10 | mov lr, r10 mov r6, r9 | mov r7, r9 mov r5, r8 | mov r6, r8 push {r5, r6, r7} | push {r6, r7, lr}
If the trick works sometimes, it very likely works all of the time. Actually, we can verify that minor things like this are not part of the problem, by slowly editing version A of the program to be more like version B, and testing incrementally whether the bug has appeared. I like to work under a version control system, so that I can always see what I've changed and rewind to the start, take checkpoints from time to time by copying files, and work in an editor with plenty of Undo depth so I can recover from minor mistakes. And indeed, we find that this way of saving registers is not the problem.
What's next? We see that the two compilers have invented different labels for the literal table that comes at the end of the function. The lines saying .loc 1 65 0
, etc., are line number information for the debugger, but they help us find the source lines that are being implemented here (they are setting up the hardware registers of the timer), and they provide a useful synchronisation point for the side-by-side listing.
.loc 1 65 0 .loc 1 65 0 movs r3, #1 movs r3, #1 ldr r2, .L20 | ldr r2, .L22 str r3, [r2] str r3, [r2] .loc 1 66 0 .loc 1 66 0 movs r2, #0 movs r2, #0 ldr r1, .L20+4 | ldr r1, .L22+4 str r2, [r1] str r2, [r1] .loc 1 67 0 .loc 1 67 0 ldr r1, .L20+8 | ldr r1, .L22+8 str r2, [r1] str r2, [r1] .loc 1 68 0 .loc 1 68 0 > ldr r2, .L22+12 movs r1, #4 movs r1, #4 ldr r2, .L20+12 < str r1, [r2] str r1, [r2]
You can see that the compilers have generated equivalent code, with a minor difference in the order of operations. No crucial differences yet.
The next part of the comparison seems to show code generated by Compiler B that has no counterpart in the code from Compiler A.
.loc 1 27 0 .loc 1 27 0 movs r7, r3 | movs r5, r3 .L2: | b .L2 > .L4: .LBE7: .LBE7: .LBE6: .LBE6: > .loc 1 83 0 > ldr r3, .L22+44 > ldr r3, [r3] > cmp r3, #0 > bne .L20 > .L6: > .loc 1 88 0 > movs r0, #9 > bl clear_pending > .LVL4: > .loc 1 89 0 > movs r0, #9 > bl enable_irq > .LVL5: > .L2: .loc 1 79 0 .loc 1 79 0
Closer inspection reveals, however, that similar code does appear on both sides, but in a different order. Compiler B is just a bit more enthusiastic than Compiler A in re-ordering the program so that the tangle of jumps and labels is simplified a bit. What is easy to miss is that in the first couple of instructions, Compiler B is using register r7
when Compiler A has used r5
. Since registers are, from the hardware point of view, identical in function, and registers r4
--r7
all have the same usage rules from a software point of view, this difference ought not to matter.
Similar differences in register usage occur in the next part of the code.
.loc 1 79 0 .loc 1 79 0 add r5, sp, #16 | add r4, sp, #16 movs r1, r5 | movs r1, r4 ldr r0, .L20+44 | ldr r0, .L22+48 bl receive bl receive .LVL4: | .LVL6: .loc 1 81 0 .loc 1 81 0 ldrh r3, [r5] | ldrh r3, [r4] cmp r3, #1 cmp r3, #1 beq .L4 beq .L4 cmp r3, #3 cmp r3, #3 beq .L5 beq .L5
What we see here is the code that calls receive(ANY, &m)
, then enters a switch
statement based on the message type m.m_type
. The arguments of receive()
are the value ANY = 999
in r0
and the address of m
in r1
. Since m
lives in the stack frame, its address is computed by adding an offset to the stack pointer, and to save repeating that calculation in order to access m.m_type
later, the address is saved in a register across the call the receive()
– register r5
in one program, and register r4
in the other. You can see after the call, the code retrieves a two-byte value from the message (with ldrh
), then tests whether it is equal to 1 (INTERRUPT
) or 3 (REGISTER
).
Closing the gap
Apart from trivial points (which we can edit away if they are worrying), the main difference between the two programs is in the use of registers. What if some but not all registers were being trashed at some point, and program B saved a crucial value in a register that was trashed, whereas program A used a register that was not trashed at the same point? We can experiment with this in two ways.
First, we can change program A to use the same registers as program B. In fact, we can make a table that shows the correspondence between registers used by program A and those used by program B.
Program A Program B r5 r4 r6 r7 r4 r6 r7 r5
(the correspondence extends to registers r8
, r9
, ...) You could systematically edit the code with this substitution, but cyclic permutations of renaming are always a bore to do in a text editor. A better idea is to replace actual registers with symbolic names, so r4
becomes rr4
and r5
becomes rr5
, etc., throughout all the code (except for the boilerplate at the start of the function). Then we can write some macro definitions:
#define rr4 r6 #define rr5 r4 #define rr6 r7 #define rr7 r5
and adjust the register assignment at will, using the C preprocessor to make the substitutions for us. We quickly find that this reveals the problem: permuting registers r4
to r7
to match Program B causes the bug to appear; and in fact just swapping registers r4
and r5
is enough.
Let's look again at the code surrounding receive()
– I'm suspicious about that, because register values are preserved across operating system calls by our context switch code, not by the compiler's normal prologue and epilogue boilerplate. The process also contains a call to send
, but I don't think that's executed before the crash; there's also a call to connect
, but that seems (superficially at least) to complete successfully.
.loc 1 79 0 .loc 1 79 0 add r5, sp, #16 | add r4, sp, #16 movs r1, r5 | movs r1, r4 ldr r0, .L20+44 | ldr r0, .L22+48 bl receive bl receive .LVL4: | .LVL6: .loc 1 81 0 .loc 1 81 0 ldrh r3, [r5] | ldrh r3, [r4] cmp r3, #1 cmp r3, #1 beq .L4 beq .L4 cmp r3, #3 cmp r3, #3 beq .L5 beq .L5
It's vital to the correctness of this code that r5
(or r4
) is preserved across the call, but in fact the value preserved is easily recomputed after it. Let's try replicating the add
instruction after the call (and of course trying this both using r5
and using r4
).
.loc 1 79 0 add r5, sp, #16 movs r1, r5 ldr r0, .L20+44 bl receive .LVL4: .loc 1 81 0 add r5, sp, #16 @@ Replicated instruction ldrh r3, [r5] cmp r3, #1 beq .L4 cmp r3, #3 beq .L5
Bingo! The problem disappears, even if we are using r4
in place of r5
.
Delving into details
We haven't used a debugger yet. Why not? Because we have had no hypothesis to test! The debugger would let us look (in a high level way) at the values of variables in the timer_task()
function, but we would just see that the values are indeed nonsense. Debugging at the machine level would let us see great screeds of numbers, like the green-on-black screens of The Matrix. But that really wouldn't help. But now we have a hypothesis:
Register
r4
is not preserved across calls toreceive()
.
Or, refining that a bit,
Register
r4
is not preserved across calls toreceive()
, when the call is completed by the arrival of an interrupt.
Now is the time to get out the debugger.
- Edit the program so the primes and UART part is disabled.
- Set a breakpoint at
default_handler
inmpx-m0.s
. - Let her rip!
The breakpoint is soon reached. We can single-step instruction by instruction (stepi
), or we can step over subroutine calls (nexti
).
default_handler: push {lr} bl isave mrs r1, ipsr @ Get exception reason subs r1, #16 @ Convert to range [-16 .. 32) bl intr_handler bl irestore pop {pc}
There's a bit of code there that fetches a view of the psr
containing the interrupt number, then subtracts 16 so that internal interrupts occupy the range [-16..0) and external ones are in the range [0..32). The value we see after this is 9, and that's correct for timer 1. We can follow the code along into the operating system, which calls make_ready
first on device driver process for the timer, then on the the current process. Then there's a call to choose_proc
that picks the device driver to run next, and we're back in default_handler
again. We can watch as irestore
restores the context of timer_task
, including a value for r4
that is plausible as an address in its stack frame. Then default_handler
returns ... to something called wrap_default
. The disassembled code looks like this:
000011f4 <wrap_default>: 11f4: b510 push {r4, lr} 11f6: f7ff fe9e bl f36 <default_handler> 11fa: bd10 pop {r4, pc}
This is a trivial function from startup.c
, included so that interrupt vectors can be redirected to a default handler configured at link time. And look, it saves register r4
on entry and restores it on exit. Normally, this is benign, and the compiler has done it so that the number of registers saved is even, because of requirements of the calling convention that are not really relevant to the Cortex-M0. But for us it is fatal, because the r4
value being restored here is the one that was present when the interrupt happened, not the one from before the call to receive()
that was just painstakingly restored by irestore()
. There's the bug.
The fix
Now that the problem is properly understood, it is not hard to fix. The mechanism surrounding wrap_default
works well enough for programs that make ordinary use of interrupts, but it won't do for the bottom layer of our operating system, where an interrupt can save context for one process, then in returning restore the context of another one. For them, it's necessary for the interrupt vector to lead directly to the micro:bian interrupt handler, without any intervening saving and restoring of registers. It's a simple matter to make a vector table where all the interrupts (or almost all) point to the same handler directly.
Some closing reflections: the temptation to suspect a compiler bug was a strong one, but unjustified. Yielding to the temptation might just have wasted some time, but it might also have led to a poor decision to disable the inlining and leave it at that. This program would have worked, but the latent bug would have remained and might re-emerge in the future. In a concurrent program, the flow of control can be unpredictable, depending on timing. In micro:bian, even if a process is not usually awakened by an interrupt, that can happen on rare occasions, and a bug like the one we've just fixed could then cause a fault that would be very hard to diagnose.