AKA A surprising thing that happened to me while porting Contiki to the STM32F1.
AKA Some steps to take when diagnosing an unexpected hard fault on ARM Cortex M3
I already have a STM32L1 port working (for the basic uses of Contiki) and the major difference with this port is that it should support pretty much any target that libopencm3 supports. So I made a new platform and tweaked the GPIO settings for the STM32F1, and flashed it to my STM32VL Discovery board, and…. it started, but then it crashed.
Program received signal SIGINT, Interrupt. blocking_handler () at ../../cm3/vector.c:86 86 { (gdb) bt #0 blocking_handler () at ../../cm3/vector.c:86 #1#2 update_time () at contiki/core/sys/etimer.c:72 #3
Now, I don’t see unhandled exceptions much these days. I consulted the Configurable Fault Status Register
(CFSR) at 0xE000ED28 and compared that to the definitions in ARM’s “Cortex M3 Devices Generic User Guide” (link will google search to the current location of that doc)
(gdb) x /wx 0xE000ED28 0xe000ed28: 0x01000000 (gdb)
Ok, some bit in the top 16bits. That’s the Usage Fault Status Register
(UFSR). Let’s look at it a little closer because I can’t count hex digits in my head as well as some people.
(gdb) x /hx 0xE000ED2a 0xe000ed2a: 0x0100 (gdb)
Ok. That bit means, Unaligned access UsageFault. Awesome. One of the big selling points of ARM Cortex-M is that it doesn’t care about alignment. It all “just works”. Well, except for this footnote: "Unaligned LDM, STM, LDRD, and STRD instructions always fault irrespective of the setting of UNALIGN_TRP"
Ok, so let’s see what caused that. GDB “up” two times to get to the stack frame before the signal handler. x /i $pc
is some magic to decode the memory at the address pointed to by $pc.
(gdb) up #1(gdb) up #2 update_time () at contiki/core/sys/etimer.c:72 72 if(t->timer.start + t->timer.interval - now < tdist) { (gdb) x /i $pc => 0x80005c6 : ldmia.w r3, {r1, r4} (gdb) info reg r0 0x7d2 2002 r1 0x393821d9 959979993 r2 0x39381a07 959977991 r3 0x29d0fb29 701561641 r4 0x20000dc4 536874436 r5 0x2000004c 536870988 r6 0x0 0 r7 0x14 20 r8 0x20001f74 536878964 r9 0x20000270 536871536 r10 0x800c004 134266884 r11 0xced318f5 -825026315 r12 0x0 0 sp 0x20001fb8 0x20001fb8 lr 0x80005b9 134219193 pc 0x80005c6 0x80005c6 xpsr 0x21000000 553648128 (gdb)
Check it out. There’s an ldm
instruction. And r3 is clearly not aligned. (It doesn’t even look like a valid pointer to SRAM, but we’ll ignore that for now) Ok, so we got an unaligned access, and we know where. But what the hell?! Let’s look at the C code again. That t->timer
is all struct stuff. Perhaps there’s some packed uint8_ts or something, maybe some “optimizations” for 8bit micros. Following the chain, struct etimer
contains a struct process
, which contains a struct pt
which contains a lc_t
. And only the lc_t
. Which is an unsigned short. I guess there’s some delicious C rules here about promotion and types and packing. There’s always a rule.
Changing the type of lc_t
to an unsigned int, instead of a short and rebuilding stops it from crashing. Excellent. Not. It does make the code a little bigger though.
karlp@tera:~/src/kcontiki (master *+)$ cat karl-size-short text data bss dec hex filename 51196 2836 3952 57984 e280 foo.stm32vldiscovery karlp@tera:~/src/kcontiki (master *+)$ cat karl-size-uint text data bss dec hex filename 51196 2916 3952 58064 e2d0 foo.stm32vldiscovery karlp@tera:~/src/kcontiki (master *+)$
I’m not the first to hit this, but it certainly doesn’t seem to be very common. Apparently you should be able to use -mnounaligned-access
with gcc to force it to do everything bytewise, but that’s a pretty crap option, and it doesn’t seem to work for me anyway. Some people feel this is a gcc bug, some people feel it’s “undefined behaviour”. I say it’s “unexpected behaviour” :) In this particular case, there’s no casting of pointers, and use (or lack thereof) of any sort of “packed” attributes on any of the structs, so I’d lean towards saying this is a compiler problem, but, as they say, it’s almost never a compiler problem :)
Here are some links to other discussion about this. (complete with “MORON! COMPILERS ARE NEVER WRONG” type of helpful commentary :)
- Lots of hate, related to memcpy (Keil apparently patched their supplied memcpy to avoid this problem)
- This one uses casting of pointer types
- Discussion about initial gcc support for using the “feature” of unaligned accesses
I’m still not entirely sure of the best way of proceeding from here. I’m currently using GCC version arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 4.7.3 20121207 (release) [ARM/embedded-4_7-branch revision 194305]
, and I should probably try the 4.7-2013-q1-update release, but if this is deemed to be “user error” then it’s trying to work out other ways of modifying the code to stay small for everyone where possible, but still work for everyone.
Not entirely what I’d planned on doing this evening, but someone enlightening at least.
you might have been lucky that a “ldm” instruction triggered the fault and not a store-type instruction.
On Cortex M3/4/7 there is the “Buffer” enabled by default which could lead to a delayed triggering of the fault which makes debugging next to impossible. at least in my understanding.
here is a good blogpost on this DISDEFWBUF-issue:
http://chmorgan.blogspot.de/2013/06/debugging-imprecise-bus-access-fault-on.html