assembly language

All posts tagged assembly language

Amateur Optimization Hour, part 2

Posted by coronax on September 13, 2018

Posted in: Programming, Retrocomputing. Tagged: 6502, assembly language, c64, Retrochallenge. 4 Comments

When I started this project, I knew that C wouldn’t be adequate for some performance-critical bits of code, and that I’d have to write those pieces in assembly. For those cases, I figured the C implementation could work as a prototype. It’s easier to figure out algorithms and iterate designs with C, after all.

DrawMap (and its utility functions, DrawMapSection and DrawBlankMapSection) is a good example. Previously, I worked on the C code, and got the map draw time from an initial 131 ms down to 80 ms – a pretty good start.

I’m not the greatest assembly language programmer, so it’s helpful for me to start with the compiler’s assembly version of the C algorithm. To do that, I just had to ask CC65 to write the assembly output to a file – there’s even an option to include the original C code as inline comments. Then I took the functions I was interested in (DrawMapSection and DrawBlankMapSection, to start) and put them in a .asm file.

Of course, it wouldn’t compile right away. I had to export the new implementation and figure out a bunch of “.import” directives so that the CA65 assembler would understand all the symbols that the initial implementation used:

.export _DrawMapSection
.export _DrawBlankMapSection

;; used by DrawMapSection
.import _gmr
.import incsp6
.importzp sp
.import decaxy
.importzp sreg
.import _tilemap
.import _colormap
.import incaxy
.importzp ptr1
.importzp regsave
.import pushax
.import pusha
.importzp regbank
.import incax1

Symbols with a leading “_” are variables in my C code. _gmr is a struct containing all the former local variables for these functions, and _tilemap and _colormap are arrays of data for drawing the map tiles. Here’s gmr, for reference:

struct MapRenderInfo
{
  // passed in from DrawMap
  uchar width; // section width in tiles
  uchar height; // section height in tiles
  uchar* maprow; // mapdata for top left tile
  uchar* screenrow; // screen position for top left
  uchar* colorrow; // color RAM for top left tile

  // Used by DrawMapSection & DrawBlankMapSection
  uchar pixmap_index; // index for character graphics
  uchar colormap_index; // index for color map
  uchar tile; // tile id
  uchar y; // row loop counter
  uchar* map; // pointer to current map tile
};

struct MapRenderInfo gmr;

Most of the remaining symbols are for subroutine calls related to CC65’s software stack. I wanted to remove as many of those as possible. To understand why, it’s helpful to understand how CC65 treats different kinds of variables.

Variable storage in CC65

Basically, variables will be handled in one of three ways:

Local variables, as I’ve discussed before, are stored on a special software stack. Creating them requires calling subroutines like the “pushax” and “pusha” above. Using local variables is really slow, so I’ve already removed them from the important parts of the draw routines.
Global and static variables. These variables are located in the DATA or BSS segments of the program, somewhere between 0x0800 and 0xD000 (where the C64’s I/O chips start). It generally takes 4 clock cycles to load or save the value of a global.
Register variables. The C64 doesn’t really have enough registers that it can afford to dedicate any of them to a particular variable, but CC65 does try to take advantage of the C “register” keyword. “Register” is a hint that the compiler should store the variable in a reserved chunk of space in page zero of the C64’s memory map. That space starts at the “regsave” symbol and its size is… 6 bytes. Zero page is important because variables stored there can be accessed more quickly (3 cycles instead of 4) and because it allows the use of additional addressing modes. The indirect indexed mode, for example, is really useful for dereferencing pointers, especially pointers that point to an array of data (like a map or a display screen). So I wanted to make as much use of those register spaces as possible.
Implementation variables. Finally, there are “variables” that are used internally by CC65 for the code it generates. They include a number of the imported symbols mentioned above like ptr1, sp, and regsave. It would probably be a terrible idea to use those memory locations myself, and anything I do could easily be broken in an update to the compiler. But that doesn’t mean I’m not gonna do it.

Knowing more than the compiler does

The reason a human programmer can outperform a compiler like CC65 is largely because the human knows more about what’s going on than the compiler does. For example, the compiler sees a line like:

gmr.pixmap_index = (gmr.tile & 0x7f) + ENTITY_TYPE_MAX;

I already talked about the “& 0x7f” bit last time – I changed the code there so that the compiler would do an 8-bit operation instead of a 16-bit subtraction. Even then, the CC65 assembly was:

lda _gmr+10  ; gmr.tile
ldx #$00
and #$7F
ldy #$14     ; ENTITY_TYPE_MAX
jsr incaxy
sta _gmr+8   ; gmr.pixmap_index

That “incaxy” subroutine is a 16-bit add. But I know that the final value will always be less than 256, so we only need to do an 8-bit add:

lda _gmr+10  ; gmr.tile
ldx #$00
and #$7F
clc
adc #$14     ; ENTITY_TYPE_MAX
sta _gmr+8   ; gmr.pixmap_index

This is in the inner loop of DrawMapSection, where even a small thing like getting rid of the overhead of a subroutine call is meaningful. It reduced the overall draw time to:

78 ms

Simplifying the array accesses

A lot of what the inner loop of the draw routine does is copy bytes from two arrays (colormap and tilemap) into two other arrays (s1, which points to the screen memory of the row being drawn; and c1, which points to that row’s color memory). The indices are doing all the hard work, pointing at the right tile characters, the right color data, and the right position along the row. Some of CC65’s code generation choices are, nevertheless, a little baffling.

Consider, for example, this line:

c1[x] = colormap[gmr.colormap_index];

For which the compiler generated this assembly:

lda regbank+0     ; low byte of c1
ldx regbank+0+1   ; high byte of c1
clc
adc regbank+5     ; x
bcc L2101
inx
L2101: sta ptr1
stx ptr1+1
ldy _gmr+9        ; gmr.colormap_index
lda _colormap,y
ldy #$00
sta (ptr1),y

C1, s1, and x are heavily used in the inner loop, so I declared them as register variables – they take up the entire six bytes of “register” space that CC65 provides. But I wasn’t getting as much value out of them as I wanted.

What’s going on here? The first few lines of code are generating the address of c1[x]. It then copies that address to one of the private implementation variables in zero page that I mentioned so that it can dereference the pointer. The thing is that c1’s value was already in zero page, so there’s a much more direct way of performing this indirect access. I replaced this code with:

ldy regbank+5     ; x
ldx _gmr+9        ; gmr.colormap_index
lda _colormap,x   ; lookup actual color value
sta (regbank+0),y ; save to c1[x]

I hope you can forgive me for storing the variable named “x” in the Y register, but the indexed indirect addressing only works with Y. Anyway, that’s a much shorter routine, and got the time down to:

75 ms

What’s even better is that the inner loop contains a total of 8 lines like that (since the tile we’re drawing consists of 4 characters and 4 bytes of color data), and they can all be rewritten in about the same way:

56.8 ms

I actually didn’t notice until I was much farther along, but the code to look up map indices also had a similar layout. That leaves us at:

53.1 ms

Bumming instructions

After making those changes the total code was a lot shorter, making it easier to see some opportunities for removing individual instructions. For example, I was able to keep my index variables in registers longer, without worrying about keeping them up to date in their permanent storage in memory. Basically, that meant I could replace some 6 clock cycle inc statments with 2 cycle iny statements.

Sometimes the compiler gave me easy opportunities, too. For example, this:

lda _gmr
asl a
sta _gmr

Could just be changed to:

asl _gmr

which saves 4 clocks. The original code puts the value in the accumulator, but we don’t use it again for a while, so that doesn’t matter.

50.5 ms

Don’t forget DrawBlankMapSection!

I’d reached a point where I’d cut enough out of the inner loop of DrawMapSection that I figured it was time to copy those changes to DrawBlankMapSection. The benefit was immediate:

39.2 ms

Looking at the code in that context, I had another insight. Instead of incrementing the color_index and pixmap_index variables like I had been doing, I instead added constants to them where needed in the inner loop, and then just changed their values once at the end of the loop body. This was another example of a change meant to avoid the expensive read-modify-write instructions like doing an inc of a memory location.

37.6 ms

Pausing to take stock

In this entry and the previous one, I’ve optimized the map drawing code in this project, reducing the draw time for a “typical” view from 131 ms to 37.6 ms – a 3.6x speedup. It’s still not where I would like it, but the optimizations have been focused on the innermost loop – the hottest part of the code. There’s certainly room for additional optimization – I’ve hardly touched the DrawMap() function which sets up the map sections to be drawn, and that code has expensive integer multiplications in it.

On the other hand, it’s likely that those optimizations will come a little more slowly. At this point, I think I’m going to keep poking away at this code in the background, but also start to focus on some of the other goals I have for this run through the Retrochallenge.

A Better Software SPI Routine

Posted by coronax on June 18, 2017

Posted in: Electronics, Programming, Retrocomputing. Tagged: 6522, assembly language, Project:65, SPI. Leave a comment

When I first built the Project:65 computer, I was pretty much a novice at 6502 assembly language. So when it came to the hard parts, I usually opted for “easy to understand” over “fast”. Since then I’ve had more practice, I have better tools, and I’ve read a lot of material that’s given me ideas for better bit manipulation.

Bit manipulation is what this project is all about. I’m using a 65c22 VIA to communicate over SPI with a MAX3100 UART. The 6522 has a couple of parallel ports that are user-controlled. Arduino or Raspberry Pi aficionados might think of them as collections of bidirectional GPIO pins. I’m using four of these to implement the SPI communication between the two chips.

    SPI CONNECTION: 6522 VIA PORT B to MAX3100
   ________                             ______
           |                           |
        PB7|<----- RECEIVE (NEW) --+--<|DOUT
        PB6|                       |   |
  6522  PB5|                       |   | MAX3100
  VIA   PB4|                       |   |  UART
        PB3|<----- RECEIVE (OLD) --+   |
        PB2|>------- TRANSMIT -------->|DIN
        PB1|>------ CHIP SELECT ------>|/CS
        PB0|>--------- CLOCK --------->|SCLK
           |                           |
        CB2|<------- INTERRUPT --------|/IRQ
        CB1|                           |
   ________|                           |______

THE 6522 PARALLEL PORT CONTAINS 8 BIDIRECTIONAL DATA
PINS (PB0 TO PB7) AND 2 "CONTROL" PINS (CB1 AND CB2). 
ARROWS INDICATE THE DIRECTIONS WE ARE USING TO SEND
AND RECEIVE DATA.

Every command to the MAX3100 is a two-byte sequence and is accompanied by a two-byte response. Rather than worry about the high-level of the protocol, though, I started out by looking at a routine I wrote called “spibyte”. This is a very generic routine that simply sends a single data byte out the SPI port while simultaneously reading back the response from the 3100, but it doesn’t do anything to interpret those responses. Still, this is where the actual bits are being banged, so it’s where most of the time is being used.

The starting version of this code was pretty simple and had a lot of branching and subroutine calls in it – one for each bit:


; sends the value in the accumulator thru SPI.
; returns the value it reads back.
spibyte:
            sta writebuffer
            jsr rwbit
            jsr rwbit
            jsr rwbit
            jsr rwbit
            jsr rwbit
            jsr rwbit
            jsr rwbit
            jsr rwbit
            lda readbuffer
            rts


; rwbit writes one bit of writebuffer and reads
; one bit into readbuffer.  Both buffers are
; left shifted by one place.
rwbit:
            rol writebuffer
            bcs nonzero
            lda	VIA_DATAB
            and	#%11111011	; set output (output)low
nzret:
            sta VIA_DATAB    ; set output bit
            inc VIA_DATAB    ; set clock high    
            ; receive bit
            asl readbuffer
            lda VIA_DATAB    ; read input bit
            and #$8
            beq reczero
            inc readbuffer   ; add a 1 bit 
reczero:
            dec VIA_DATAB    ; set clock low
            rts
nonzero: 
            lda VIA_DATAB
            ora	#4	     ; set output high
            jmp nzret

So as you can see, that’s a lot of instructions that have to be executed every time I want to send out a single byte. If I was using a UART directly on the CPU bus, this could all be done in a handful of instructions instead. Still, I could make it better.

Those subroutine calls definitely had to go. Since I wrote the original code, I’ve switched to using the ca65 assembler, which has a directive to simply incorporate a block of code n number of times. That lets me remove the subroutine call without having eight copies of the “rwbit” routine muddying up my source code.

My original implementation used the first four bits of the 6522’s parallel port (VIA_DATAB). This allowed me to make some smart moves – for example, I can toggle the status of the clock line with a single INC (increment) or DEC (decrement) operation, without even loading a value into the CPU’s accumulator.

I was also originally very careful not to change the values of the rest of the serial port. I’ve since decided that this is unnecessary – if I do use those bits in the future, it’ll probably be to support other SPI devices using the same interface.

However, this still wasn’t an optimal choice. Something I read that went over my head originally was that if the RECEIVE line (which carries the 3100’s response back to the 6522) is in bit 7 instead of bit 4, I can get rid of some logic operations and branches – most of the stuff after the “receive bit” comment. Instead, I can load the port value into the accumulator, rotate it to the left (so that bit 7 gets copied to the CPU’s carry flag), and then rotate the read buffer (so the carry flag value gets copied into its least significant bit). This way I can copy the incoming bit to the destination with three instructions, and without every caring what the value of that bit actually is. Of course, this required moving a wire on the breadboard, but it’s neat to think that you’re modifying the hardware to support a software optimization.

Here’s the improved “spibyte” routine:


;; spibyte
;; Sends the byte in accumulator and receives a
;; new byte into accumulator.
.proc spibyte
	sta spi_writebuffer
.repeat 8			; copy the next section 8 times
.scope
	lda #%01111000		; base DATAB value with chip select for
				; MAX3100 and a zero bit in the output
				; line.
	rol spi_writebuffer
	bcc writing_zero_bit
	ora #%00000100		; write a 1 bit to the output line.
writing_zero_bit:
		
 	sta VIA_DATAB    	; write data back to the port
	inc VIA_DATAB    	; set clock high

	lda VIA_DATAB		; Read input bit
	rol			; Shift input bit to carry flag
	rol spi_readbuffer	; Shift carry into readbuffer

	dec VIA_DATAB    	; set clock low
.endscope
.endrepeat
	lda spi_readbuffer	; result goes in A
	rts
.endproc

The core of this routine is down from 17 instructions (including the “jsr rwbit”) to 10, which is pretty satisfying. I’m pretty sure there’s still an opportunity to bum a few instructions of the routine by using some of the newer features of the 65c02, but this is a good baseline to work off of.

Retrochallenge: Success!

Posted by coronax on July 31, 2013

Posted in: Programming, Retrocomputing. Tagged: 6502, Arduino, assembly language, EEPROM, Project:65, Retrochallenge. Leave a comment

In the final stages of this project, the Arduino has been doing double duty as an SD card controller (left) and, when things go wrong, as an emergency EEPROM burner (top).

It’s just about the end of July, and while I haven’t accomplished everything I wanted for my Retrochallenge project, I have reached a good place to conclude. It’s been fun and challenging and occasionally frustrating – but definitely worth it. Most of all, I got stuff to work, and that makes me very happy.

Once I got my 6502 to talk to the Arduino and read data back from the SD card, I had to integrate all that code into the P:65’s primitive operating system, which lives in an 8 KB EEPROM. Shortly before the Retrochallenge started, I wrote a program that let the P:65 download and rewrite the EEPROM itself. It’s a lot easier than pulling the EEPROM out and sticking it in my homemade programmer every time I make a change.

The first time I tried that, I wrote a ROM image that was so bad the poor thing couldn’t even boot up again afterwards. It turned out my linker configuration was screwed up, and it put the “.rodata” section in front of where the code was supposed to start, and the EEPROM burner code proceeded to put everything in the wrong place.

Even when the code was right, sometimes I’ve been wrong. Several times now I’ve accidentally flashed the EEPROM with the wrong data file. In particular, at least once I’ve overwritten my operating system with the code for the EEPROM burning utility itself. This… did not work.

As for the actual software (all written in 6502 assembly), I’m happy to say that most of my bugs have turned out to be simple ones. That doesn’t mean they were easy to track down, by any means, but once they were identified they were easy to fix. That’s lucky, since it’s been really tricky to debug this software. Since I’m changing the way the low-level input/output system works, it’s hard to generate any debug output without further messing up the code I’m trying to fix.

Eventually, I got the basic I/O routines built into the OS and running correctly. I now have a generic device interface that can be used to talk to four virtual devices:

0     Serial port / console
1     SD Card command channel (list files, make directories, etc.)
2     SD Card file #1
3     SD Card file #2

For example, when I try to write a character to a file, I call a generic putc() function. The generic putc() looks up the specific device driver putc() function for the current device and passes the character to be written to it. Devices 2 and 3 can be used to open files on the SD card by name and read or write data to them.

With the basic functions all working, it was time to write some “shell commands” that I could use interactively from the P:65’s command prompt. Running out of time, I just implemented the basics:

ls directory     Get a list of files in a directory.
mkdir name       Create a new directory.
rmdir name       Delete a directory.
rm name          Delete a file.
more name        Print the contents of the file name to the console.
load name        Load and run the named file.
save name        Save the most recently downloaded file to the SD card.

Out of all of these, the more command probably took the longest to get right – that’s because it was the first to use the full range of file open and read commands, and it required a lot of time tracking down a handful of really trivial (but hard to spot) bugs, on both the 6502 and Arduino sides of the communications.

The load command also had me stymied for a while because of a bug in my pointer math when putting the new data into memory. Luckily, once I figured that out I was able to avoid a similar problem in the save command, which came together very quickly earlier this evening.

Up to now, I’ve had to download programs to the P:65 using XMODEM over its serial port. Finally being able to load programs into memory from the SD card is a big step up in convenience – big enough that I can say I’ve completed the first phase of this project.

Being able to load programs from a “disk” makes the P:65 feel more like a real computer!

There’s certainly more to do with this project. For example, I can load my BASIC interpreter from the SD card, but the interpreter doesn’t know how to load or save BASIC files. I’m also interested in adding support for this IO code to the CC65 standard library, so that I can write (very short) C programs for the Project:65 computer that use functions like fprintf and fscanf. I’d hoped to get into that this month, but it’s going to be a significant project all on its own.

Overall, I’m happy with where I ended up. Having an explicit goal and a deadline made things go faster than they would have otherwise, and I mostly avoided distractions. Plus, watching the progress on some of the other Retrochallenge projects kept me motivated – if you haven’t yet, you should definitely check them out.

Retrochallenge: 6522 Parallel Communications

Posted by coronax on July 3, 2013

Posted in: Electronics, Programming, Retrocomputing. Tagged: 6502, 6522, assembly language, Retrochallenge. 1 Comment

I’ve only just started this project and I’ve already had to bust out the logic analyzer. Note the different lengths of the 0 pulses in the top row – a cheap debugging trick I describe below.

So here I am, two full days into my RetroChallenge project – and not a lot to show for it yet. But half an hour ago I was at “nothing to show for it”, so that’s an improvement.

I decided to start on my “disk controller” by figuring out the communication between the 6502 computer and an Arduino (which will hold the SD card, and which I’ll replace with a bare ATmega later on). I’ve done this before with an SPI protocol, but this time I wanted to upgrade and use a parallel port for communications. I figured I’d be able to read or write an entire byte at a time, and that would make my life easier further down the road.

Yeah, what’s that old Donald Knuth saying? “Premature optimization is the root of all evil?” Yeah. He ain’t kidding.

The 6522 VIA that I use for the P:65 computer’s IO actually has two parallel ports. I’m using one of them for my bit-banged SPI circuit, but the other one, Port A, was still available. The 6522 also supports handshaking using two additional lines, CA1 and CA2. The computer signals the peripheral (the Arduino) with CA2, and the peripheral signals back using CA1. It seemed like it should be possible to manage two-way communication using these signals.

The VIA’s datasheet gives a reasonably good description of how this works, but there are a lot of settings, and what I really could’ve used was an example or two. Unfortunately, Googling mostly just gave me links back to my own blog. (Hey, Google: personalized search results are nice, but that’s not helping!)

I went through quite a bit of trial and error getting this to work. I experimented with the 6522’s output latches before deciding to just leave them off, and tried several different modes for the handshake signals.

When the VIA wants to write to the Arduino, it puts a data byte on the parallel port and then pulses CA2 (the VIA is set to do this automatically whenever the port is read from or written to). Then, the P:65 waits for a pulse on CA1 which indicates that the Arduino has read the byte.

When the Arduino writes to the VIA, the Arduino puts data on the port and pulses CA1. Then it waits for the P:65 computer to pulse CA2. Of course, both sides need to be on the same page about who’s writing and who’s reading, or the whole thing breaks down.

I’m not doing any significant interrupt processing on the P:65 yet, so when it’s the P:65’s turn to wait I’m just busy-waiting on the 6522 Interrupt Flag Register. That’s actually where my biggest problem came from: When I do a write followed by a read, the P:65 is waiting on the IFR twice in a row without doing anything else. In this particular case, I need to manually clear the IFR between the two waits. If I’d read or written to the data port, that would’ve automatically cleared the interrupt flag. That sounds simple, but it took me a couple hours with the logic analyzer to puzzle out what was actually happening.

I’ll mention one debugging trick I discovered while working on this. On the Arduino side of this circuit, there are three different places where the code sends a pulse on CA1, and I was having a hard time telling them apart when I was using the logic analyzer. I added some delay commands so that each pulse had a different duration, and suddenly I was able to see exactly which one was which.

Here’s a bit of code to read a character from the Arduino. We actually send one character (a read command) and then read back two more (a status byte and the actual data). I want to emphasize that this code has worked for me exactly once; I’m sure there’ll be changes once I start really using it.

; Get character.  Returns result in A, and Carry is true if 
; a character was read.            
SD_GETC:
          lda     #%00001010
          sta     VIA_PCR    ; set CA2 to pulse mode, CA1 negative edge trigger
          lda     #$ff       ; data port to write mode
          sta     VIA_DDRA
          lda     #$19       ; tell the Arduino I want to read a byte
          sta     VIA_DATAA  ; write to port A
          ; wait for arduino to read
@wait:    lda     VIA_IFR    ; busy wait for IFR bit 1 to go high
          and     #$2
          beq     @wait

          lda     #$FF
          sta     VIA_IFR    ; clear the interrupt register
          stz     VIA_DDRA   ; data port to read mode

          ; wait for arduino to write status byte
@wait3:   lda     VIA_IFR    ; busy wait for IFR bit 1 to go high
          and     #$2
          beq     @wait3

          lda     VIA_DATAA  ; read status byte (also clears IFR)
                             ; status 0 = good, 1 = not ready, 2 = eof
          bne     @nochar

          ; wait for arduino to write data byte
@wait2:   lda     VIA_IFR    ; busy wait for next char - real data
          and     #$2
          beq     @wait2
          lda     VIA_DATAA  ; retrieve actual character read
          sec                ; read a byte, so set carry
          rts

@nochar:  clc
          rts                ; no char read, so clear carry & return

Mixing some C into the Baking Pi course

Posted by coronax on March 10, 2013

Posted in: Programming. Tagged: assembly language, C, Raspberry Pi. Leave a comment

My post last week left me wondering how hard it would be to mix a few C functions into the assembly-language “operating system” I’ve been following in Alex Chadwick’s Baking Pi tutorial. The answer, it turns out, is “not very hard” – but it does highlight a few elements of the compilation process that many developers (maybe excluding those doing a lot of embedded stuff) tend to take for granted.

The most common way of mixing C and assembly is probably to include inline assembly in C or C++ code. The idea is that C is doing most of the work, and the assembly is added in for the occasional special purpose. Here, I’m doing the opposite. I’ve got about 1300 lines of ARM assembly that already builds and works together. I want to add a C function that a) I can call from assembly, and b) that can itself call other assembly language routines. To do this, I need four things:

1. A C compiler

Conveniently, I’ve already got that. The Baking Pi course recommends a project called YAGARTO (Yet Another GNU Arm Toolchain) – basically, it’s the GCC compiler built as an ARM cross-compiler running on Windows. I suppose the other option would be to just use GCC under Raspbian on the Pi itself, but I’m doing enough swapping of SD cards as it is.

Baking Pi itself is all assembly, so it uses the YAGARTO assembler (arm-none-eabi-as) and linker (arm-none-eabi-ld), but the C compiler is also there, and just waiting to be used. The one thing I’m missing is a big chunk of the standard C library. A lot of the library depends on the underlying OS (think about File I/O, or something like printf() ) – and I don’t really have one yet.

2. A Makefile

Next, we need to teach the project’s makefile how to compile the C parts of the program. It’s been a while since I manually edited a makefile, but in this case I was able to squeak by with some minor changes to the default Baking Pi makefile. First, by teaching it that “.c” files should also produce corresponding object files:

# The names of all object files that must be generated. Deduced from the 
# assembly code files (and C source files) in source.
ASM_OBJECTS := $(patsubst $(SOURCE)%.s,$(BUILD)%.o,$(wildcard $(SOURCE)*.s))
C_OBJECTS := $(patsubst $(SOURCE)%.c,$(BUILD)%.o,$(wildcard $(SOURCE)*.c))
OBJECTS := $(ASM_OBJECTS) $(C_OBJECTS)

And then by adding a rule for compiling “.c” into “.o”:

# Rule to make the object files.
$(BUILD)%.o: $(SOURCE)%.c
    $(ARMGNU)-gcc -I $(SOURCE) $< -c -o $@

That wasn’t so bad, was it?

3. A Linker script

It turns out that’s not quite enough to get my C code to build. Item three on the to-do list is to adjust the linker script, and this is the part I had the least experience with. Before this, I’ve had to manually edit linker scripts twice in my life. The first was actually for the Baking Pi course, to adjust the provided script to deal with changes in the Raspberry Pi bootloader. The second time was to teach CC65 to link code correctly for my Project:65 breadboard computer.

The linker script tells the linker how to put all the bits and pieces of the object files together into an executable. The object files (and the original assembly) split code into different sections. For example, “.text” contains the program code, and “.data” contains initial values for global variables. If you want to see the default linker script used by GCC under Linux, try to compile a simple program with the argument “-Wl,-verbose”. It’s… long. Happily, the Baking Pi script skips most of that complexity. There’s an “.init” section that’s hardcoded to 0x8000, “.text” starting at 0x8080 (my adjustment from last time), and a “.data” section following “.text”.

My C function required two other sections, “.rodata” and “COMMON”, which require a little explanation. “.rodata” stands for “read-only data”, which in this case was mostly string constants. When I was working on Project:65’s kernel, that was an important distinction, because anything in “.rodata” could stay in ROM, whereas “.data” needed to be allocated and initialized in RAM. In this case the entire program is running out of RAM, so it’s not important, and I just put the “.rodata” section after the regular “.data”.

Finally, there’s the “COMMON” section, about which I don’t know a lot. GCC apparently uses it for uninitialized global variables that can appear in more than one file. It’s a backward-compatibility thing, and if you turn it off this data would probably go into the “.bss” section instead.

Here’s my edited linker script:

/***************************************************************************
*    kernel.ld
*     by Alex Chadwick
*
*    A linker script for generation of raspberry pi kernel images.
***************************************************************************/

SECTIONS {
    /*
    * First and formost we need the .init section, containing the IVT.
    * coronax - changed this from 0000 to 8000 to work with newer rpi boot
    * loader, which loads code to 8000 regardless
    */
    .init 0x8000 : {
        *(.init)
    }

    /* 
    * We allow room for the ATAGs and the stack and then start our code at
    * 0x8000.
    * coronax - changed to 8080 to make room for .init section
    */
    .text 0x8080 : {
        *(.text)
    }

    /* 
    * Next we put the data.
    */
    .data : {
        *(.data)
    }

    /* coronax - added .rodata and COMMON sections */
    .rodata : {
        *(.rodata)
    }

    COMMON : {
        *(COMMON)
    }

    /*
    * Finally comes everything else. A fun trick here is to put all other 
    * sections into this section, which will be discarded by default.
    */
    /DISCARD/ : {
        *(*)
    }
}

4. Some source code written in C

Finally, I needed some test code. I decided to write a little test function that did some string manipulation and printing using the assembly-language FormatString and DrawString methods from Baking Pi’s Screen03 and Screen04 tutorials. FormatString is a poor man’s sprintf(), while DrawString() actually draws a string into the screen buffer.

The only real trick here is that I needed to create C function declarations for my assembly language routines, so the C compiler would know how to pass data back and forth. The assembly-language routines in the Baking Pi course all use the same calling conventions as GCC, which makes this easier. The first four arguments for a function are in registers R0 through R3, and any subsequent arguments go on the stack. The return value of a function is placed in R0. My function declarations looked like:

struct Tag
{
    int LengthInWords;
    short int TagNum;    /* Half word containing the tag number */
    short int MagicNum;    /* All tags have the value 0x5441 here. */
    char TagContent[1];
};

int FormatString (char* format, int len, char* buffer, ...);
void DrawString (char* buffer, int len, int x, int y);
struct Tag* FindTag (int tag_num);

That Tag structure needs some explanation. FindTag is an assembly-language routine that searches through a set of tag objects containing startup information. The one we care about is the command-line arguments passed to our kernel. The function returns a pointer to a structure in memory, so I created an equivalent C struct, with one minor cheat. The real tags include their data inline, so they have variable length. In this case, I’m using a phony one-character array that I can pass to the string functions.

Finally, here’s my actual C function. It’s got local variables, uses the stack, calls assembly language routines, and so on. All it’s really doing is printing a few lines of text, but it’s a good example of how a little C can make for a friendlier experience without sacrificing too much performance.

int myfunc ()
{
    int x = 105;
    int y = 169;
    int len, slen;
    struct Tag* cmdline;
    char *s1, *s2, *end;

    /* Tag 9 contains command-line parameters for kernel. */
    cmdline = FindTag (9);
    /* This is the magic formula for length of tag data: */
    slen = (cmdline->LengthInWords * 2) - 8; 
    len = FormatString ("Tag num is %d and the magic number is 0x%x.",
        43,buffer,(int)cmdline->TagNum,(int)cmdline->MagicNum);
    DrawString (buffer, len, x, y);
    y += 16;

    s1 = (cmdline->TagContent);
    end = s1 + slen;

    while (s1 < end)
    {
        for (s2 = s1; (s2 < end) && (*s2 != ' '); ++s2)
            ;
        DrawString (s1, s2-s1, x, y);
        y += 16;
        s1 = s2 + 1;
    }
}

Calling myfunc() from assembly is also simple; just use the command “bl myfunc”. One advantage of using plain old C (instead of C++) on ARM is that there’s no name-mangling required, as there would be for C++. Instead, function and variable names in C can be used as label names in assembly, and vice versa.

It didn’t take very long to figure out how to get these two languages to play nice with each other, and I’m glad I did it. Using C for the “connective tissue” between assembly routines makes development a lot simpler, and it may encourage me to spend some more time expanding this pseudo-operating-system to see how much it can really do.

More Assembly for Pi

Posted by coronax on March 4, 2013

Posted in: Programming. Tagged: assembly language, Raspberry Pi. Leave a comment

Bare-metal programming the Raspberry Pi: Now with text and numerical formatting. You know what you don’t get with this kind of low-level programming? An easy way to take screenshots, that’s what!

Back in October, I mentioned that I’d been working my way through the Baking Pi course – an introduction to bare metal programming for the Raspberry Pi computer. I’d just started getting into the graphics routines, but around that time I started focusing on my Project:65 computer, and I set the Pi aside for a while.

Since the Raspberry Pi’s first birthday was this week, I figured it was high time I got back into it and actually did something with it. In fact, I’ve got a couple of ideas. For example, I’d like to use the Pi to control my EEPROM burner and give it a more convenient interface. But first, I figured it was time to get back into the Baking Pi tutorials.

It’s probably worth mentioning the big hullabaloo about the Raspberry Pi’s graphics drivers that happened a few months back, because it ties in directly to how the Baking Pi tutorials handle rendering. When the Pi was first released, there was very little documentation about the graphics system. The tutorials took the simplest possible approach: Get the graphics driver to give us a writable framebuffer (this in itself had to be partly reverse engineered) and then do all the graphics as software rendering into the buffer.

More information has been released since then, but in a way that hasn’t entirely made the community happy. A software driver for the Pi has been released, but it turns out the driver doesn’t actually do very much. Most of the real functionality is in the GPU’s firmware, which remains closed source. Now, every GPU has some functionality in firmware, but this is apparently an extreme case.

From an Open Source zealot’s point of view, it’s an unsatisfactory situation. From my point of view doing bare-iron development for fun, it means there are actually two options: Software rendering, like in the Baking Pi tutorials, or talking directly to the OpenGL ES library that’s been crammed into the firmware blob. The latter is certainly a possibility, but probably more complex than I want to try to do in ARM assembly.

Sure, it just looks like a bunch of random lines, but the important part of this test program is that they’re lines, and that they’re (pseudo) random. So… okay, yeah. Just random lines.

When I was working on this stuff back in October, I left off just as things were getting interesting. I’d just finished the line drawing routines, and so the stuff I’ve been doing this week is all about rendering and formatting text. And I promise that’s a lot more exciting than it sounds.

Just getting a character up on the screen is quite a bit of work. The tutorial takes the expedient approach of linking the bitmaps of the font directly into the kernel image. It’s the only option, since Baking Pi never gets around to discussing filesystem access (which, admittedly, would be a big topic). So rendering a character becomes a task of finding its image data in memory and copying it into the right location in the framebuffer, one pixel at a time. You’re basically looking at pointer arithmetic and a few nested loops. A pretty good quantity of code, but not a lot of complexity of code.

Of course, I was coming to this project after a break of nearly four months, which adds its own complexity. I stepped away in the first place because of limited time, but also because switching between 6502 assembly and the much more complex ARM assembly in different problems was making my head hurt. After all, this is one of the reasons that high-level languages were invented in the first place.

Once I sat down and concentrated, I was able to “swap in” the ARM assembly I’d learned before, but it took a while before I really felt up to speed. 6502 stuff I can mostly write from memory, but for the ARM CPU I can’t get very far without a reference sheet. It’s important to remember that the ARM assembly language is much more complex: There are more opcodes, they take more arguments, and they have more variants and options.

A good example of how the complexity and options can trip you up happened to me when I started to work on routines to format numbers for printing. For example, I was trying to convert the number 456 to the character string “456” – but all I got was the “6”. I banged my head against this problem for more than an hour before I realized how trivial my mistake was. All the math was fine, and all the digits calculated correctly. But when I copied the character representation of each digit into my string buffer, I was using a 32-bit store (STR) instead of an 8-bit store (STRB). String-handling madness ensued.

Of course, it didn’t help that I’d been spending so much time lately doing 6502 assembly, where all operations are 8-bit operations. A 32-bit CPU is a different kind of beast altogether.

A close-up picture of the number-formatting test cases. Hopefully semi-legible.

Speaking of calculations, one of the really fun things about this part of the tutorial was learning a method for division of binary integers. One of the first really complicated bits of 6502 assembly I learned was a software multiple for 16-bit integers. Well, the ARM CPU may have a built-in multiply, but it still doesn’t have a division operation.

In the tutorial, the division operator is used to convert numbers to different numeric bases for display. You divide the number by the base, and the remainder is the rightmost digit of the result. Do that a bunch of times and you’re done.

The actual division operation is done in binary, and is conceptually pretty similar to conventional long division. One advantage of doing it in binary is that you never have to do any multiplication – it’s pretty much all handled by a series of shifts and subtractions. As with multiplication, it’s an expensive operation, but pretty understandable once you’ve worked through a few examples in your head.

In order to keep my fun, colorful background while working on the string formatting and drawing routines, I also added a couple routines for drawing filled or outlined rectangles. Quickly putting together those two routines made me think about some of the tradeoffs in the Baking Pi tutorials – specifically, performance versus robustness.

For example, the tutorial’s implementation of the Bresenham line-drawing algorithm is done entirely in screen coordinates. Each pixel is sent to a DrawPixel routine that checks if the pixel is on-screen and then calculates its position in the framebuffer’s memory. There are ways to combine these operations for a significant speed boost, and in assembly written for a C64 demo, those optimizations are done as a matter of course. The approach here is more like traditional application development: small, well-encapsulated functions with error-checking and a lot of abstraction. This is a great way to write software, but it does kind of defeat the purpose of writing it in assembly language instead of C.

This is something I’ve been thinking about for future projects. I’ve been doing assembly language coding as a hobby because I appreciate the thrill of communicating with the machine on that low level. On the other hand, it does take a long time to write anything really substantial. It might be time to start combining the high-level languages with the languages of the raw iron. One way to do that would be to add some functions written in C to the Baking Pi’s pseudo-OS. I’d probably learn a lot about ABIs and the practicalities of linker configuration by doing that. Alternately, I would like to play with the CC65 compiler for the C64 (or my own Project:65). Just ideas for now, but anything’s fair game.

CJ's Project Blog

software, hardware, and occasional geekery