This is a short post about stuff I learnt while programming the AVR processors by atmel. It is mostly a write-up so I won’t forget things 2 weeks down the line 🙂
I have built a MIDI controller based on the atmega8 processor by atmel. The processor itself has 1 kB of RAM and 8 kB of ROM. 1kB of that ROM is used by the bootloader, which is used to transmit new firmware images over sysex. As the sysex decoding and checksumming is quite complex, I could just barely fit the bootloader in those 1024 bytes.
I first wrote the bootloader in C, using the avr-gcc compiler (version 4.2.0). Sadly, the resulting code was just slightly over the 1024 bytes boundary I had to meet (you can only program the bootloader start adress on 512 bytes, 1024 bytes or 2048 bytes). That led to my first dabbling in optimizing gcc created code. I have to admit I don’t start writing assembly from scratch, even now I write the main framework in C, and then rewrite specific functions in assembler. You can see the C source code of the bootloader here (though it seems this version fits in 1024 bytes, I must have done something to it too) : bootloader.c . I won’t go into the bootloader details in this post (this will be the subject of a later post). Here is the assembler version: bootloader-asm.s .
First thing when trying to optimize for size is checking the size of the file. I do this using the tool avr-size:
stechapfel:midi-kontrol manuel$ avr-size bootloader.o text data bss dec hex filename 902 12 4 918 396 bootloader.o
This tells us that the .text section (program code) of the file is 902 bytes big, the .data section (initialized section) is 12 bytes, and the .bss (uninitialized data section) is 4 bytes big. The bss does not need to be stored in the ROM, so the final space taken up in ROM by our file is 924 bytes. While working on the main firmware, I wrote myself a small script to calculate how many ROM bytes I had left out of the 7 kB available:
stechapfel:midi-kontrol manuel$ cat left.sh #!/bin/sh avr-size $1 | tail -1 | cut -c1-20 | (read i j&& echo $((1024 * 7 - (i + j))) "bytes left")
This way, I can always check how much space is available, for example:
stechapfel:midi-kontrol manuel$ ./left.sh midi-kontrol.hex 36 bytes left
which shows that my firmware is pretty tight indeed.
Once we have the size information about the complete firmware, we need to check which functions are biggest, in order to know which should be optimized first. I do this using the avr-nm utility:
stechapfel:midi-kontrol manuel$ avr-nm --size-sort bootloader.o 00000001 B block_cnt 00000001 B in_sysex 00000001 C sysex_cnt 00000002 B jump_to_app 00000006 D ack_msg 00000006 D nak_msg 00000008 T bl_uart_putc 0000000a T bl_uart_getc 00000018 T midi_sysex_send_ack 00000018 T midi_sysex_send_nak 0000001c T jump_to_main_program 0000001e T write_firmware_checksum 00000036 T handle_sysex 0000003a T check_firmware_checksum 0000003e T write_checksum 00000040 C sysex_data 00000046 T make_word 00000064 C data 00000070 T main 0000007e T handle_midi 00000128 T write_block
The first column shows the size of a symbol in hexadecimal, the second column shows the section of the symbol (T is text, C is unitialized, B is bss, and D is data). This shows which function might lead itself to optimization. For the bootloader, I just had to scrape off a few bytes to make it fit, so I just skimmed over the assembly source and kind of “peephole-optimized” by hand. You can see in the bootloader-asm.s that it is not hand-written assembler, but the output from gcc which has been modified. To generate the assembler listing out a C file, use the -fverbose-asm -S flags for gcc, along with the other flags you have. I wrote a Makefile rule to make just that:
%.s: %.c $(CC) -S $(CFLAGS) -fverbose-asm $< -o $@
I also generate an assembler listing along with every .c file I compile to check things later on. Here is my normal compile rule:
%.o: %.c Makefile $(CC) $(CFLAGS) -Wa,-adhlns=$@.lst -c $< -o $@ %.o: %.s Makefile $(CC) $(CFLAGS) -Wa,-adhlns=$@.lst -c $< -o $@
Let's look at the code avr-gcc generates for the bootloader.c file: bootloader.o.lst. As you can see gcc includes the original sourcecode along with the generated assembler. Sometimes they don't line up quite well, so you have to read between the lines.
Let's take a bit of time to explain the register usage of avr-gcc. The atmega8 has 32 registers, from r0 to r31. Each register is 8 bit. C data types are: char 8 bit, int 16 bits, long 32 bits, long long 64 bits.
r0 and r1 are fixed registers: r0 is the temporary registers, which can be overwritten by any function (except interrupts). Use it to store temporary stuff. r1 is the zero register and should always be cleared (you can use it for something else temporarily, but must make sure it's clear when you're done). This has bitten me a few times, because the mul operand writes into r1, and I would forget to clear it, wondering what weird things my avr would do afterwards.
r18-r27 and r30-r31 are "call-used registers", you can use them for whatever you want in your own functions, but must be aware that they can be clobbered by any C function you call. These registers are very useful because your own functions can actually clobber them too without taking care of them. You don't have to push them on entering your function, and popping them on leaving. Also, when calling your own assembly routines, you know which are going to get clobbered and which don't. The main work around optimizing for space is working within the boundaries of the call-used registers. Each register that you have to push/pop will take up 4 bytes of ROM (actually a bit less if you use a jumpable epilogue, which I will get into later).
r2-r17 and r28-r29 are call-saved registers, which you have to push on entry and pop on exit of your function. If your function is only called by assembler code, you can forego these limitations and track register usage by hand. The advantage of these registers is that once you are using them, you can use them to store data and call C functions without having to worry about them.
Function calls use the register pairs r25:r24, r23:r22, r21:r20 down to r9:r8. Additional arguments are passed on the stack. Return values are passed in r25:r24. The register pairs r27:r26, r29:r28 and r31:r30 can be used to indirectly address memory. This is very important because clever array addressing loops are the main way to optimize C code for space.
Now that this is clear, let's go through the bootloader.s file, and I'll write down what I notice. It's been 1 month since I actually optimized that file, so I don't remember what I exactly did.
The first defined function is check_firmware_checksum. I'll go into a bit of detail to show how a function is defined in assembler so that the avr-* tools can work with it.
101 .section .text.check_firmware_checksum,"ax",@progbits 103 .global check_firmware_checksum 105 check_firmware_checksum:
The .section directive is important and is used to put each function into its own section. This is used later on by the linker with the --gc-sections option, which removes every non-referenced function from the final firmware file. .global declares the symbol as a global symbol so that other files can link against it. Finally check_firmware_checksum: defines the actual function entry point. THe size information is now stored in .stabd sections if I understand this correctly, however I use the .size check_firmware_checksum,.-check_firmware_checksum directive at the end of the function. It defines the size of the symbol check_firmware_checksum as the difference between the current address and the function start address. This is important so that you can check the size of your assembler functions. By the way, __eeprom_read_word_1C1D1E doesn't follow normal C calling semantics, in case you were wondering.
By going through the first functions, we see that avr-gcc does a quite good job at optimizing for space. In midi_sysex_send_ack we see how avr-gcc correctly replaces the loop index by a check on the incremented pointer. In make_word, we can use our first surgical code excision:
499 0024 E62F mov r30,r22 500 0026 FF27 clr r31 501 0028 E050 subi r30,lo8(-(data)) 502 002a F040 sbci r31,hi8(-(data)) 503 002c 2081 ld r18,Z 504 002e 3327 clr r19 505 0030 4427 clr r20 506 0032 5527 clr r21 507 0034 282B or r18,r24 508 0036 392B or r19,r25 509 0038 4A2B or r20,r26 510 003a 5B2B or r21,r27
r22 is the loop counter, incremented by one every iteration. We could remove the need for r22 altogether, charge up r30 as loop counter, and check that for the boundary at the end. Also, the clr and or magic is unnecessary, as we could just move r25 into r19, and let the data stay in r26. The way it looks in bootloader-asm.s, it seems that was not the point where I optimized. I do realize now that I must have used an older version of gcc at the time, because the new code is quite approaching perfect, the major optimization I can find in my assembler code is avoiding to clear both r24 and r25 when returning an 8 bit value. I guess real optimization things will have to wait for a second post, so this post is more of an introduction into assembler on the AVR than real optimization magic.