# Assembly Z80, best way to divide by 16

Page 1/2
| 2

Hi guys, I need to improve this code, so far I got this result (using standard Z80 times, not M1 times).

Tem goal is to divide by 16 the value on H register (ideally not destroying HL, nor DE registers):

```    ; ; divide by 16 (32 cycles)
; srl     h
; srl     h
; srl     h
; srl     h

; ; divide by 16 (27 cycles)
; ld      a, h
; and     1111 0000 b
; rrca
; rrca
; rrca
; rrca

; divide by 16 (18 cycles)
ld      c, h
ld      a, (bc)
```

The third one needs a 256-bytes Look Up Table with all values pre-loaded (it also needs to be address aligned to 0x00 on low byte).

Speed is more crucial here than size.

Is there a best solution?

If speed is more crucial here than size, go for the look up table. If you do more than one division, you can even "reuse" b.

Btw, as we are in an MSX forum, I think you are not taking into account M1 wait states in your measures. They should read: 40 cycles, 33 cycles, and 21 cycles.

theNestruo wrote:

Btw, as we are in an MSX forum, I think you are not taking into account M1 wait states in your measures. They should read: 40 cycles, 33 cycles, and 21 cycles.

The VS Code extension that I use shows standard Z80 timing, is there one that shows MSX times?

albs_br wrote:

The VS Code extension that I use shows standard Z80 timing, is there one that shows MSX times?

If you are using Z80 Assembly meter, there is a `z80-asm-meter.platform` setting. Set it to `msx`.

Quote:
• `z80-asm-meter.platform`: Controls the instruction set to use and the timing information to display:
• `z80` (default): Uses the default Z80 instruction set and shows default timing information.
• `msx`: For MSX developers. Uses the default Z80 instruction set and shows Z80+M1 timing information (MSX standard).
• (...)

Another method:

```	ld	hl,0D000h
ld	(hl),Value	; Put the value to divide by 16 at 0D000h
xor	a
rld			; A = the value divided by 16
```

The division takes 22 cycles (+ 3 for M1).

gdx wrote:

Another method:

```	ld	hl,0D000h
ld	(hl),Value	; Put the value to divide by 16 at 0D000h
xor	a
rld			; A = the value divided by 16
```

The division takes 22 cycles (+ 3 for M1).

Interesting, but I count 44 cycles, and 47 if the value is not a register.

`rld` (and `rrd`) was my first idea, but they take 20 cycles (including M1), so any previous setup will make that solution slower than the LUT solution above. Particularly if you need to preserve the HL pair.

I gave the method with RLD because it is possible to do a series of division without having to specify the value of HL nor XOR A each time. Once is enough. Then the value to divide can be specified by register B, C, D or E (eg LD (HL),B). It is even possible to use IX instead of HL. It's slower but it can avoid manipulations that ultimately save time in some cases.

Also the method below is almost same as the one with rrca when the M1 cycle is take in account.

```     ld     a,h
srl     a
srl     a
srl     a
srl     a```

I just asked MDL to produce the optimal sequence for this (without using memory), and it came up with these two alternatives. The second is the same as @gdx proposed, but the first was curious (both of them with the same timing).

```xor a
xor h
rra
sra a
sra a
sra a
```
```ld a, h
srl a
srl a
srl a
srl a
```

It comes up with a few other alternatives with the same time, but basically variations

santiontanon wrote:

I just asked MDL to produce the optimal sequence for this (without using memory) ...

Did you maybe also exclude some immediate values from the search-space(*)? I'm asking because albs_br's original solution (ld a,h ; and 0xf0 ; 4x rrca) is faster than these "optimal" solutions.

(*) It's very typical for super-optimizers to only allow a limited number of immediate values. Otherwise the search-space explodes.

I have been trying to port upkr unpack to Z80 recently, and part of that decompression algorithm is also `((prob+8)>>4)` expression, my initial version was

```and \$F8
rra
rra
rra
rra
```

final version is a bit different because I have use also for carry coming as input (making it +0/+16 depending on the initial carry), so it's:

```    rra                             ; + (bit<<4) ; part of -prob_offset, needs another -16
and     \$FC                     ; clear/keep correct bits to get desired (prob>>4) + extras, CF=0
rra
rra
rra                             ; A = (bit<<4) + (prob>>4), CF=(prob & 8)
adc     a,-16                   ; A = (bit<<4) - 16 + ((prob + 8)>>4) ; -prob_offset = (bit<<4) - 16
```

(the example snapshot is for ZX Spectrum, but technically the code should work also with MSX, but you need to build the packer from the Rust source to test it with own data, I did ask exoticorn (author of upkr) to provide me with few ZX screens packed to avoid dealing with that part, as I was interested only to write the Z80 code and have fun with that. :) )

santiontanon wrote:

...

BTW, if you are bored, would you try MDL on the unpack.asm? Maybe I did overlook some further optimisation.

Page 1/2
| 2