Original routine
For Reevengi, I started with texture mapping routine coming from Amiga Doom, I think. I needed variable texture size, and paletted texture support. So I started with this routine:
;D0: U and V texture coordinates, packed in a 32bits value: ; UUuuVVvv ;D1: temp value ;D2: number of bits for V coordinate ;D3: number of bits for U coordinate ;D4: number of pixels to draw ;D5: DU and DV texture increments, like D0: DUduDVdv ;D6: texture pixel ;A0: texture data (1 BYTE per pixel) ;A1: palette data (1 LONG per color index, because it stores ; up to 32bits color value) ;A2: screen loop: move.l D0,D1 ; D1 = UUuuVVvv lsr.w D2,D1 ; D1 = UUuu00VV rol.l D3,D1 ; D1 = uu00VVUU move.b (A0,D1.W),D6 ; read pixel from texture move.b 3(A1,D6.W*4),D1 ; read color from palette move.b D1,(A2)+ ; write pixel to screen add.l D5,D0 ; increment texture coordinate subq.l #1,D4 bpl.s loop
Superscalar, thread-like
Amiga coders know how to code for 060 for a long time, so I stole some info here and there on their forum and blog posts. :)
As it's not easy to reorder instructions to process a single pixel, I choose to duplicate and process a second pixel in the same loop. I reorder to not have memory access on both 'threads' happening at the same time. Address registers (texture, palette) do not change, and we just need to write to the screen in 2 consecutive instructions. With enough data registers, we would have something like this:
loop: ; Pixel 1 Pixel 2 move.b (A0,D11.W),D16 move.l D20,D21 move.b 3(A1,D16.W*4),D11 lsr.w D22,D21 move.b D11,(A2)+ rol.l D23,D21 add.l D15,D10 move.b (A0,D21.W),D26 move.l D10,D11 move.b 3(A1,D26.W*4),D21 lsr.w D12,D11 move.b D21,(A2)+ rol.l D13,D11 add.l D25,D20 subq.l #1,D4 bpl.s loop
Or if you prefer a single instruction per line, and with the help of 060 user manual, chapter 10, that tells you which instructions are pairable:
loop: move.b (A0,D11.W),D16 ; pOEP, allows sOEP move.l D20,D21 ; pOEP|sOEP move.b 3(A1,D16.W*4),D11 ; pOEP, allows sOEP lsr.w D22,D21 ; pOEP|sOEP move.b D11,(A2)+ ; pOEP|sOEP rol.l D23,D21 ; pOEP|sOEP add.l D15,D10 ; pOEP|sOEP move.b (A0,D21.W),D26 ; pEOP, allows sOEP move.l D10,D11 ; pOEP|sOEP move.b 3(A1,D26.W*4),D21 ; pEOP, allows sOEP lsr.w D12,D11 ; pOEP|sOEP move.b D21,(A2)+ ; pOEP|sOEP rol.l D13,D11 ; pOEP|sOEP add.l D25,D20 ; pOEP|sOEP subq.l #1,D4 ; pOEP|sOEP bpl.s loop ; pEOP-only
Final version
For the final subq, we have space for another instruction, but it must not change the CCR, or the bpl instruction will not stop at the right time. Using NOP instruction is not an option. We can use address registers to perform some simple calculation, and the ADDA instruction does not change the CCR. So I use them for UV and DUV values and incrementing texture coordinates.
;A0: U and V texture coordinates, packed in a 32bits value: ; UUuuVVvv ;A1: screen ;A3: texture data (1 BYTE per pixel) ;A4: DU and DV texture increments, like A0: DUduDVdv ;A5: palette data (1 LONG per color index, because it stores ; up to 32bits color value) ;D0,D1,D4,D5: temp values ;D3: number of bits for U coordinate ;D6: number of pixels to draw/2 ;D7: number of bits for V coordinate move.l a0,d4 lsr.w d7,d4 rol.l d3,d4 moveq.l #0,d5 moveq.l #0,d1 loop: move.b (a3,d4:w),d5 move.l a0,d0 move.b 3(a5,d5:w*4),d4 lsr.w d7,d0 move.b d4,(a1)+ rol.l d3,d0 or.w d5,d5 ; nop adda.l a4,a0 move.b (a3,d0:w),d1 move.l a0,d4 move.b 3(a5,d1:w*4),d0 lsr.w d7,d4 move.b d0,(a1)+ rol.l d3,d4 subq.w #1,d6 adda.l a4,a0 bpl.s loop
In theory this loop should process 2 pixels in 8 cycles. The next logical step should be to store pixel data in a single word, to write both of them to screen at once, instead of 2 move.b. And don't forget the FPU, it is a third unit that can process its instructions in parallel to the 2 integer units.
Update: version without palette, and a single write to screen
;A0: U and V texture coordinates, packed in a 32bits value: ; UUuuVVvv ;A1: screen ;A3: texture data (1 BYTE per pixel) ;A4: DU and DV texture increments, like A0: DUduDVdv ;D0,D1,D2,D4,D5: temp values ;D3: number of bits for U coordinate ;D6: number of pixels to draw/2 ;D7: number of bits for V coordinate move.l a0,d4 lsr.w d7,d4 rol.l d3,d4 moveq.l #0,d5 moveq.l #0,d1 loop: move.b (a3,d4:w),d2 move.l a0,d0 lsl.w #8,d2 lsr.w d7,d0 adda.l a4,a0 rol.l d3,d0 move.l a0,d4 or.w d5,d5 ; nop move.b (a3,d0:w),d2 lsr.w d7,d4 move d2,(a1)+ rol.l d3,d4 subq.w #1,d6 adda.l a4,a0 bpl.s loop
Also, using dbra instead of subq+bpls should work, they both take 1 cycle if correctly predicted. Using dbra allows removal of the OR.W d5,d5 I used to pair instructions. It depends if you have an odd or even number of instructions in your loop to pair.