In the course of developing video player for TMS VDPs, I am trying to minimize time output to the VRAM takes. Datasheet says that register write must be min 2 us (7.15 T-cycles), VRAM write must be min 8 us (28.63 us) for SCREEN 2 mode. I am talking about Z80 equivalent T-cycles here.
Thus generally I decided that to set up pointer for writing
xor a out (099h),a ld a,040h out (099h),a
would be enough, not accounting for additional 0.15 T-cycles.
But with VRAM write I get the following picture:
OUTI + NOP -> 23 T-cycles -> buggy image
OUTI + INC DE -> 25 T-cycles -> buggy image
OUTI + LD A,(HL) -> 26 T-cycles -> good image in openMSX, but buggy image on real TMS9918 machine
OUTI + RES 0,A -> 28 T-cycles -> good image in both openMSX and on real TMS9918
Unfortunately I did not find instruction of 9 T-cycles to check, so that cumulative execution time would be 27 T-cycles. Thus the routine should look like 256 sets of OUTI + RES 0,A, with JP NZ at its end (counting number of 8-pixel blocks, in total 24 times).
Initially I had OUTI + JR NZ to this OUTI, it was consuming 31 T-cycles; then OUTI + JP NZ, it consumes 29, but with RES, while code takes more space, will execute one T-cycle less (in total 256 T-cycles less for one 8-pixel row, and 6144 T-cycles less for whole screen area).
Any better ideas?
Update: I figured out how to make 27 T-cycles, it is as simple as
ld a,(hl) inc hl out (098h),a
Right now testing, seems picture on real TMS is stable with no bugs on the screen.
I must say that I disable sprites by setting the first one to line 208.