Original routine

For Reevengi, I started with texture mapping routine coming from Amiga Doom, I think. I needed variable texture size, and paletted texture support. So I started with this routine:

;D0:	U and V texture coordinates, packed in a 32bits value:
;	UUuuVVvv
;D1:	temp value
;D2:	number of bits for V coordinate
;D3:	number of bits for U coordinate
;D4:	number of pixels to draw
;D5:	DU and DV texture increments, like D0: DUduDVdv
;D6:	texture pixel
;A0:	texture data (1 BYTE per pixel)
;A1:	palette data (1 LONG per color index, because it stores
;	up to 32bits color value)
;A2:	screen

loop:
	move.l D0,D1		; D1 = UUuuVVvv
	lsr.w	D2,D1		; D1 = UUuu00VV
	rol.l	D3,D1		; D1 = uu00VVUU
	move.b	(A0,D1.W),D6	; read pixel from texture
	move.b	3(A1,D6.W*4),D1	; read color from palette
	move.b	D1,(A2)+	; write pixel to screen
	add.l	D5,D0		; increment texture coordinate
	
	subq.l	#1,D4
	bpl.s	loop

Superscalar, thread-like

Amiga coders know how to code for 060 for a long time, so I stole some info here and there on their forum and blog posts. :)

As it's not easy to reorder instructions to process a single pixel, I choose to duplicate and process a second pixel in the same loop. I reorder to not have memory access on both 'threads' happening at the same time. Address registers (texture, palette) do not change, and we just need to write to the screen in 2 consecutive instructions. With enough data registers, we would have something like this:

loop:
;	Pixel 1				Pixel 2
	move.b	(A0,D11.W),D16		move.l D20,D21
	move.b	3(A1,D16.W*4),D11	lsr.w	D22,D21
	move.b	D11,(A2)+		rol.l	D23,D21
	add.l	D15,D10			move.b	(A0,D21.W),D26
	move.l D10,D11			move.b	3(A1,D26.W*4),D21
	lsr.w	D12,D11			move.b	D21,(A2)+
	rol.l	D13,D11			add.l	D25,D20
	
	subq.l	#1,D4
	bpl.s	loop

Or if you prefer a single instruction per line, and with the help of 060 user manual, chapter 10, that tells you which instructions are pairable:

loop:
	move.b	(A0,D11.W),D16		; pOEP, allows sOEP
	move.l D20,D21			; pOEP|sOEP

	move.b	3(A1,D16.W*4),D11	; pOEP, allows sOEP
	lsr.w	D22,D21			; pOEP|sOEP

	move.b	D11,(A2)+		; pOEP|sOEP
	rol.l	D23,D21			; pOEP|sOEP

	add.l	D15,D10			; pOEP|sOEP
	move.b	(A0,D21.W),D26		; pEOP, allows sOEP
 	move.l D10,D11			; pOEP|sOEP
	move.b	3(A1,D26.W*4),D21	; pEOP, allows sOEP
 	lsr.w	D12,D11			; pOEP|sOEP

	move.b	D21,(A2)+		; pOEP|sOEP
 	rol.l	D13,D11			; pOEP|sOEP

	add.l	D25,D20			; pOEP|sOEP
 	subq.l	#1,D4			; pOEP|sOEP

	bpl.s	loop			; pEOP-only

Final version

For the final subq, we have space for another instruction, but it must not change the CCR, or the bpl instruction will not stop at the right time. Using NOP instruction is not an option. We can use address registers to perform some simple calculation, and the ADDA instruction does not change the CCR. So I use them for UV and DUV values and incrementing texture coordinates.

;A0:	U and V texture coordinates, packed in a 32bits value:
;	UUuuVVvv
;A1:	screen
;A3:	texture data (1 BYTE per pixel)
;A4:	DU and DV texture increments, like A0: DUduDVdv
;A5:	palette data (1 LONG per color index, because it stores
; 	up to 32bits color value)

;D0,D1,D4,D5:	temp values
;D3:	number of bits for U coordinate
;D6:	number of pixels to draw/2
;D7:	number of bits for V coordinate

	move.l	a0,d4
	lsr.w	d7,d4
	rol.l	d3,d4
	moveq.l	#0,d5
	moveq.l	#0,d1

loop:
	move.b	(a3,d4:w),d5
	move.l	a0,d0

	move.b	3(a5,d5:w*4),d4
	lsr.w	d7,d0

	move.b	d4,(a1)+
	rol.l	d3,d0

	or.w	d5,d5	; nop
	adda.l	a4,a0
	move.b	(a3,d0:w),d1
	move.l	a0,d4
	move.b	3(a5,d1:w*4),d0
	lsr.w	d7,d4
	move.b	d0,(a1)+
	rol.l	d3,d4
	subq.w	#1,d6
	adda.l	a4,a0

	bpl.s	loop

In theory this loop should process 2 pixels in 8 cycles. The next logical step should be to store pixel data in a single word, to write both of them to screen at once, instead of 2 move.b. And don't forget the FPU, it is a third unit that can process its instructions in parallel to the 2 integer units.

Update: version without palette, and a single write to screen

;A0:	U and V texture coordinates, packed in a 32bits value:
;	UUuuVVvv
;A1:	screen
;A3:	texture data (1 BYTE per pixel)
;A4:	DU and DV texture increments, like A0: DUduDVdv

;D0,D1,D2,D4,D5:	temp values
;D3:	number of bits for U coordinate
;D6:	number of pixels to draw/2
;D7:	number of bits for V coordinate

	move.l	a0,d4
	lsr.w	d7,d4
	rol.l	d3,d4
	moveq.l	#0,d5
	moveq.l	#0,d1

loop:
	move.b	(a3,d4:w),d2
	move.l	a0,d0

	lsl.w	#8,d2
	lsr.w	d7,d0

	adda.l	a4,a0
	rol.l	d3,d0

	move.l	a0,d4
	or.w	d5,d5	; nop

	move.b	(a3,d0:w),d2
	lsr.w	d7,d4
	move	d2,(a1)+
	rol.l	d3,d4

	subq.w	#1,d6
	adda.l	a4,a0

	bpl.s	loop

Also, using dbra instead of subq+bpls should work, they both take 1 cycle if correctly predicted. Using dbra allows removal of the OR.W d5,d5 I used to pair instructions. It depends if you have an odd or even number of instructions in your loop to pair.