When I started this project, I knew that C wouldn’t be adequate for some performance-critical bits of code, and that I’d have to write those pieces in assembly. For those cases, I figured the C implementation could work as a prototype. It’s easier to figure out algorithms and iterate designs with C, after all.
DrawMap (and its utility functions, DrawMapSection and DrawBlankMapSection) is a good example. Previously, I worked on the C code, and got the map draw time from an initial 131 ms down to 80 ms – a pretty good start.
I’m not the greatest assembly language programmer, so it’s helpful for me to start with the compiler’s assembly version of the C algorithm. To do that, I just had to ask CC65 to write the assembly output to a file – there’s even an option to include the original C code as inline comments. Then I took the functions I was interested in (DrawMapSection and DrawBlankMapSection, to start) and put them in a .asm file.
Of course, it wouldn’t compile right away. I had to export the new implementation and figure out a bunch of “.import” directives so that the CA65 assembler would understand all the symbols that the initial implementation used:
.export _DrawMapSection .export _DrawBlankMapSection ;; used by DrawMapSection .import _gmr .import incsp6 .importzp sp .import decaxy .importzp sreg .import _tilemap .import _colormap .import incaxy .importzp ptr1 .importzp regsave .import pushax .import pusha .importzp regbank .import incax1
Symbols with a leading “_” are variables in my C code. _gmr is a struct containing all the former local variables for these functions, and _tilemap and _colormap are arrays of data for drawing the map tiles. Here’s gmr, for reference:
struct MapRenderInfo { // passed in from DrawMap uchar width; // section width in tiles uchar height; // section height in tiles uchar* maprow; // mapdata for top left tile uchar* screenrow; // screen position for top left uchar* colorrow; // color RAM for top left tile // Used by DrawMapSection & DrawBlankMapSection uchar pixmap_index; // index for character graphics uchar colormap_index; // index for color map uchar tile; // tile id uchar y; // row loop counter uchar* map; // pointer to current map tile }; struct MapRenderInfo gmr;
Most of the remaining symbols are for subroutine calls related to CC65’s software stack. I wanted to remove as many of those as possible. To understand why, it’s helpful to understand how CC65 treats different kinds of variables.
Variable storage in CC65
Basically, variables will be handled in one of three ways:
- Local variables, as I’ve discussed before, are stored on a special software stack. Creating them requires calling subroutines like the “pushax” and “pusha” above. Using local variables is really slow, so I’ve already removed them from the important parts of the draw routines.
- Global and static variables. These variables are located in the DATA or BSS segments of the program, somewhere between 0x0800 and 0xD000 (where the C64’s I/O chips start). It generally takes 4 clock cycles to load or save the value of a global.
- Register variables. The C64 doesn’t really have enough registers that it can afford to dedicate any of them to a particular variable, but CC65 does try to take advantage of the C “register” keyword. “Register” is a hint that the compiler should store the variable in a reserved chunk of space in page zero of the C64’s memory map. That space starts at the “regsave” symbol and its size is… 6 bytes. Zero page is important because variables stored there can be accessed more quickly (3 cycles instead of 4) and because it allows the use of additional addressing modes. The indirect indexed mode, for example, is really useful for dereferencing pointers, especially pointers that point to an array of data (like a map or a display screen). So I wanted to make as much use of those register spaces as possible.
- Implementation variables. Finally, there are “variables” that are used internally by CC65 for the code it generates. They include a number of the imported symbols mentioned above like ptr1, sp, and regsave. It would probably be a terrible idea to use those memory locations myself, and anything I do could easily be broken in an update to the compiler. But that doesn’t mean I’m not gonna do it.
Knowing more than the compiler does
The reason a human programmer can outperform a compiler like CC65 is largely because the human knows more about what’s going on than the compiler does. For example, the compiler sees a line like:
gmr.pixmap_index = (gmr.tile & 0x7f) + ENTITY_TYPE_MAX;
I already talked about the “& 0x7f” bit last time – I changed the code there so that the compiler would do an 8-bit operation instead of a 16-bit subtraction. Even then, the CC65 assembly was:
lda _gmr+10 ; gmr.tile ldx #$00 and #$7F ldy #$14 ; ENTITY_TYPE_MAX jsr incaxy sta _gmr+8 ; gmr.pixmap_index
That “incaxy” subroutine is a 16-bit add. But I know that the final value will always be less than 256, so we only need to do an 8-bit add:
lda _gmr+10 ; gmr.tile ldx #$00 and #$7F clc adc #$14 ; ENTITY_TYPE_MAX sta _gmr+8 ; gmr.pixmap_index
This is in the inner loop of DrawMapSection, where even a small thing like getting rid of the overhead of a subroutine call is meaningful. It reduced the overall draw time to:
78 ms
Simplifying the array accesses
A lot of what the inner loop of the draw routine does is copy bytes from two arrays (colormap and tilemap) into two other arrays (s1, which points to the screen memory of the row being drawn; and c1, which points to that row’s color memory). The indices are doing all the hard work, pointing at the right tile characters, the right color data, and the right position along the row. Some of CC65’s code generation choices are, nevertheless, a little baffling.
Consider, for example, this line:
c1[x] = colormap[gmr.colormap_index];
For which the compiler generated this assembly:
lda regbank+0 ; low byte of c1 ldx regbank+0+1 ; high byte of c1 clc adc regbank+5 ; x bcc L2101 inx L2101: sta ptr1 stx ptr1+1 ldy _gmr+9 ; gmr.colormap_index lda _colormap,y ldy #$00 sta (ptr1),y
C1, s1, and x are heavily used in the inner loop, so I declared them as register variables – they take up the entire six bytes of “register” space that CC65 provides. But I wasn’t getting as much value out of them as I wanted.
What’s going on here? The first few lines of code are generating the address of c1[x]. It then copies that address to one of the private implementation variables in zero page that I mentioned so that it can dereference the pointer. The thing is that c1’s value was already in zero page, so there’s a much more direct way of performing this indirect access. I replaced this code with:
ldy regbank+5 ; x ldx _gmr+9 ; gmr.colormap_index lda _colormap,x ; lookup actual color value sta (regbank+0),y ; save to c1[x]
I hope you can forgive me for storing the variable named “x” in the Y register, but the indexed indirect addressing only works with Y. Anyway, that’s a much shorter routine, and got the time down to:
75 ms
What’s even better is that the inner loop contains a total of 8 lines like that (since the tile we’re drawing consists of 4 characters and 4 bytes of color data), and they can all be rewritten in about the same way:
56.8 ms
I actually didn’t notice until I was much farther along, but the code to look up map indices also had a similar layout. That leaves us at:
53.1 ms
Bumming instructions
After making those changes the total code was a lot shorter, making it easier to see some opportunities for removing individual instructions. For example, I was able to keep my index variables in registers longer, without worrying about keeping them up to date in their permanent storage in memory. Basically, that meant I could replace some 6 clock cycle inc statments with 2 cycle iny statements.
Sometimes the compiler gave me easy opportunities, too. For example, this:
lda _gmr asl a sta _gmr
Could just be changed to:
asl _gmr
which saves 4 clocks. The original code puts the value in the accumulator, but we don’t use it again for a while, so that doesn’t matter.
50.5 ms
Don’t forget DrawBlankMapSection!
I’d reached a point where I’d cut enough out of the inner loop of DrawMapSection that I figured it was time to copy those changes to DrawBlankMapSection. The benefit was immediate:
39.2 ms
Looking at the code in that context, I had another insight. Instead of incrementing the color_index and pixmap_index variables like I had been doing, I instead added constants to them where needed in the inner loop, and then just changed their values once at the end of the loop body. This was another example of a change meant to avoid the expensive read-modify-write instructions like doing an inc of a memory location.
37.6 ms
Pausing to take stock
In this entry and the previous one, I’ve optimized the map drawing code in this project, reducing the draw time for a “typical” view from 131 ms to 37.6 ms – a 3.6x speedup. It’s still not where I would like it, but the optimizations have been focused on the innermost loop – the hottest part of the code. There’s certainly room for additional optimization – I’ve hardly touched the DrawMap() function which sets up the map sections to be drawn, and that code has expensive integer multiplications in it.
On the other hand, it’s likely that those optimizations will come a little more slowly. At this point, I think I’m going to keep poking away at this code in the background, but also start to focus on some of the other goals I have for this run through the Retrochallenge.