Unaligned memory access fault on Cortex-M3

AKA A surprising thing that happened to me while porting Contiki to the STM32F1.
AKA Some steps to take when diagnosing an unexpected hard fault on ARM Cortex M3

I already have a STM32L1 port working (for the basic uses of Contiki) and the major difference with this port is that it should support pretty much any target that libopencm3 supports. So I made a new platform and tweaked the GPIO settings for the STM32F1, and flashed it to my STM32VL Discovery board, and…. it started, but then it crashed.

Program received signal SIGINT, Interrupt.
blocking_handler () at ../../cm3/vector.c:86
86	{
(gdb) bt
#0  blocking_handler () at ../../cm3/vector.c:86
#1  
#2  update_time () at contiki/core/sys/etimer.c:72
#3  

Now, I don’t see unhandled exceptions much these days. I consulted the Configurable Fault Status Register (CFSR) at 0xE000ED28 and compared that to the definitions in ARM’s “Cortex M3 Devices Generic User Guide” (link will google search to the current location of that doc)

(gdb) x /wx 0xE000ED28
0xe000ed28:	0x01000000
(gdb) 

Ok, some bit in the top 16bits. That’s the Usage Fault Status Register(UFSR). Let’s look at it a little closer because I can’t count hex digits in my head as well as some people.

(gdb) x /hx 0xE000ED2a
0xe000ed2a:	0x0100
(gdb)

Ok. That bit means, Unaligned access UsageFault. Awesome. One of the big selling points of ARM Cortex-M is that it doesn’t care about alignment. It all “just works”. Well, except for this footnote: "Unaligned LDM, STM, LDRD, and STRD instructions always fault irrespective of the setting of UNALIGN_TRP" Ok, so let’s see what caused that. GDB “up” two times to get to the stack frame before the signal handler. x /i $pc is some magic to decode the memory at the address pointed to by $pc.

(gdb) up
#1  
(gdb) up
#2  update_time () at contiki/core/sys/etimer.c:72
72	      if(t->timer.start + t->timer.interval - now < tdist) {
(gdb) x /i $pc
=> 0x80005c6 :	ldmia.w	r3, {r1, r4}
(gdb) info reg
r0             0x7d2	2002
r1             0x393821d9	959979993
r2             0x39381a07	959977991
r3             0x29d0fb29	701561641
r4             0x20000dc4	536874436
r5             0x2000004c	536870988
r6             0x0	0
r7             0x14	20
r8             0x20001f74	536878964
r9             0x20000270	536871536
r10            0x800c004	134266884
r11            0xced318f5	-825026315
r12            0x0	0
sp             0x20001fb8	0x20001fb8
lr             0x80005b9	134219193
pc             0x80005c6	0x80005c6 
xpsr           0x21000000	553648128
(gdb) 

Check it out. There’s an ldm instruction. And r3 is clearly not aligned. (It doesn’t even look like a valid pointer to SRAM, but we’ll ignore that for now) Ok, so we got an unaligned access, and we know where. But what the hell?! Let’s look at the C code again. That t->timer is all struct stuff. Perhaps there’s some packed uint8_ts or something, maybe some “optimizations” for 8bit micros. Following the chain, struct etimer contains a struct process, which contains a struct pt which contains a lc_t. And only the lc_t. Which is an unsigned short. I guess there’s some delicious C rules here about promotion and types and packing. There’s always a rule.

Changing the type of lc_t to an unsigned int, instead of a short and rebuilding stops it from crashing. Excellent. Not. It does make the code a little bigger though.

karlp@tera:~/src/kcontiki (master *+)$ cat karl-size-short 
   text	   data	    bss	    dec	    hex	filename
  51196	   2836	   3952	  57984	   e280	foo.stm32vldiscovery
karlp@tera:~/src/kcontiki (master *+)$ cat karl-size-uint 
   text	   data	    bss	    dec	    hex	filename
  51196	   2916	   3952	  58064	   e2d0	foo.stm32vldiscovery
karlp@tera:~/src/kcontiki (master *+)$

I’m not the first to hit this, but it certainly doesn’t seem to be very common. Apparently you should be able to use -mnounaligned-access with gcc to force it to do everything bytewise, but that’s a pretty crap option, and it doesn’t seem to work for me anyway. Some people feel this is a gcc bug, some people feel it’s “undefined behaviour”. I say it’s “unexpected behaviour” :) In this particular case, there’s no casting of pointers, and use (or lack thereof) of any sort of “packed” attributes on any of the structs, so I’d lean towards saying this is a compiler problem, but, as they say, it’s almost never a compiler problem :)

Here are some links to other discussion about this. (complete with “MORON! COMPILERS ARE NEVER WRONG” type of helpful commentary :)

I’m still not entirely sure of the best way of proceeding from here. I’m currently using GCC version arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 4.7.3 20121207 (release) [ARM/embedded-4_7-branch revision 194305], and I should probably try the 4.7-2013-q1-update release, but if this is deemed to be “user error” then it’s trying to work out other ways of modifying the code to stay small for everyone where possible, but still work for everyone.

Not entirely what I’d planned on doing this evening, but someone enlightening at least.

  1. you might have been lucky that a “ldm” instruction triggered the fault and not a store-type instruction.

    On Cortex M3/4/7 there is the “Buffer” enabled by default which could lead to a delayed triggering of the fault which makes debugging next to impossible. at least in my understanding.

    here is a good blogpost on this DISDEFWBUF-issue:
    http://chmorgan.blogspot.de/2013/06/debugging-imprecise-bus-access-fault-on.html

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>