Interesting test! I just tried it, commented out the jr c,WaitReady
in VDPCommand_Execute_HL, and the CPU time of the diagonal tile rendering code goes down from 33.55% (11.18 ms) to 26.90% (8.97 ms). Looks like it’s definitely VDP bound there for 2.22 ms while drawing the 32 tile fragments.
@Grauw. What is the size of command (W x H) you are sending to vdp in diagonal scrolling ? 248 cycles is about a scanline!
I posted the vdpcmdtrace of all the copies I do each 30 fps frame here:
VDPCmd YMMM-IMP (0,782)->(0,1008),0 [256,2] -- player sprite patterns copy VDPCmd HMMM-IMP (88,560)->(200,16),0 [4,16] -- tile copies for diagonal scroll VDPCmd HMMM-IMP (8,512)->(200,32),0 [4,16] -- ... VDPCmd HMMM-IMP (8,512)->(200,48),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,64),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,80),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,96),0 [4,16] VDPCmd HMMM-IMP (24,608)->(200,112),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,128),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,144),0 [4,16] VDPCmd HMMM-IMP (88,512)->(200,160),0 [4,16] VDPCmd HMMM-IMP (88,544)->(200,176),0 [4,16] VDPCmd HMMM-IMP (88,528)->(200,192),0 [4,16] VDPCmd HMMM-IMP (88,560)->(200,208),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,224),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,240),0 [4,16] VDPCmd HMMM-IMP (8,512)->(200,0),0 [4,16] VDPCmd HMMM-IMP (84,544)->(212,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(228,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(244,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(4,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(20,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(36,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(52,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(68,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(84,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(100,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(116,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(132,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(148,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(164,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(180,16),0 [4,16] VDPCmd HMMM-IMP (4,512)->(196,16),0 [4,16] VDPCmd YMMM-IMP (0,984)->(0,976),0 [256,4] -- sprite colour table copy
Those 4x16 copies should take about 700 cycles to complete.
Might be nice to extend the profiler script at some point to show a chart like this, including a bar for the VDP commands executing…
sorry i didn't notice this "VDPCmd HMMM-IMP (88,560)->200,16),0 [4,16] -- tile copies for diagonal scroll"
So those are!
VDP continues to surprise me in it's slowness. i think the overhead is bigger because of the small width x "larger" height format...
it is moving a byte every 20 z80 T-states. z80 can do better with unrolled OUTI :-(
That is a problem but there is no solution. The mask only gives you 8px of black bar, so the scroll process spares the work drawing 4px width. Maybe 16px of mask would have been better. And so small blocks you can only fit putting the next command between those 4x16 copies.
If the mask is 16 pixels wide I would still need to do sixteen 16x16 copies within one frame. A mask of 32 would give me sufficient buffer to do a single copy per pixel scrolled, but that would be a bit too much horizontal screen space sacrificed . For that one should just use the 2-page horizontal scroll mode, at the cost of VRAM. So my use of 4x16 copies is due to my choice not to use the 2-page scroll mode.
But I’m not too bothered by the copy speed currently. It’s nice at least that it executes in parallel, and the CPU isn’t waiting for it excessively much, just a bit. If I had to move all that data with the CPU (which I did consider earlier) I would’ve been in a lot more trouble with my frame time!
If I had to move all that data with the CPU (which I did consider earlier) I would’ve been in a lot more trouble with my frame time!
Thinking about this a bit more… I just measured that currently each tile spends ~900 cycles in VDP command set-up, wait and execution code. If I would replace that with a CPU->VRAM transfer via HMMC and stored tiles in memory as both 16x4 and 4x16 data you could OUTI those 32 bytes straight, with the math and paging overhead it would probably end up at a comparable speed.
So for me I don’t think it’s interesting to pursue that approach currently. However maybe it’s a more attractive proposition in screen 7, 8 or 11, because you wouldn’t need to store the tile set in VRAM (at the cost of 128K RAM/ROM memory for 256 tiles).
I think the overhead is bigger because of the small width x "larger" height format...
The overhead of 4x16 copies (640 cycles) compared to 16x16 copies (2048 cycles) is 25%. [1]
I did a test with a custom ISR, using im 2, to see what happens if ISR always makes sure s#2 is selected, as it reduces the VDP wait code to something much tighter:
; with our custom ISR, s#2 is always selected instead of s#0 WaitReady: in a,(VDP_PORT_1) rra jr c,WaitReady
It speeds things up quite a bit, so you can try it out:
ISR (I placed it behind Application_Main):
org #4040 Application_ISR: push af xor a out (VDP_PORT_1),a ; select s#0 ld a,15|128 out (VDP_PORT_1),a in a,(VDP_PORT_1) ; read s#0 and a ; does INT originate from VDP (b7=1 - True) ld a,2 ; select s#2 for fast VDP command ready checks out (VDP_PORT_1),a ld a,15|128 out (VDP_PORT_1),a jp p,notFromVDP ; no vdp interrupt push bc push de push hl push ix push iy exx ex af,af' push af push bc push de push hl call H.TIMI pop hl pop de pop bc pop af ex af,af' exx pop iy pop ix pop hl pop de pop bc notFromVDP: pop af ei reti
Custom ISR setup code:
; custom ISR ld a,#e0 ; ivec table @ #e000..#e100 ld i,a ld bc,256 ld h,a ; #e000 ld l,c ld d,h ; #e001 ld e,b ld (hl),#40 ; ISR routine @ #4040 ldir im 2
Cool, I tried it, looks like it saves 2% CPU time (0.67 ms, 2400 cycles) when scrolling diagonally. Good info.
I happened to implemented a custom ISR with IM2 two days ago, for screensplits… Pre-selecting status register 1 may be needed to get them tight, but let’s see.
Playtime is over, as my long Easter weekend has come to an end and real-life games programming/optimizing is back on the table
Good luck with your project; it has real potential (my own attempt at such engine a long, long time ago was trying to do a two layer approach, so the sprites could move behind trees/walls, etc. This engine would update sprite pattern tables depending on what was in the foreground. Also layer drawing was made cheaper by allowing certain tiles to be simple LINE,BF type of VDP commands, but that's something that has to fit the art-style of the game.)
My final results (43+% free CPU time when scrolling diagonally):
The boost to idle mode is because you have a minor bug in your code: you still do tile collision testing when the player doesn't move and I fixed it in my local version.
Playtime is over, as my long Easter weekend has come to an end and real-life games programming/optimizing is back on the table
Haha, cool though that you checked it out .
Good luck with your project; it has real potential (my own attempt at such engine a long, long time ago was trying to do a two layer approach, so the sprites could move behind trees/walls, etc. This engine would update sprite pattern tables depending on what was in the foreground. Also layer drawing was made cheaper by allowing certain tiles to be simple LINE,BF type of VDP commands, but that's something that has to fit the art-style of the game.)
Thanks. I’ve thought about that, but it seems a bit complicated to do with sprites. One approach would be to store bitmasks for the terrain and manually mask out the every frame. Seems expensive though since it needs to process 256 bytes of pattern data on the 60 fps loop.
For now my plan is to simply disallow the player from moving behind objects. If I want to have some narrow overhang in specific places (like an archway over a 1-tile wide passage) I can put a static sprite object there. Up to four sprites per line remain for those kind of things.
My final results (43+% free CPU time when scrolling diagonally):
Nice . So let’s see if I got it right how you got there...
3% frame time from inlining stuff. Esp. in the sprite attribute update it seems there’s over a whole percent gained. I think it might be a good idea to start using macros for my getter functions.
2% frame time from copy wait loops defaulting to status register 2.
2% frame time from... other general optimisations. I especially see 1% gain in the player move update / collision handling. More inlining?
The boost to idle mode is because you have a minor bug in your code: you still do tile collision testing when the player doesn't move and I fixed it in my local version.
Actually that’s intentional... I don’t optimise for those cases and just always run it, because it reduces the amount of variation in code paths, and I need to optimise for the worst case anyway. I prefer things to just always run so that I have a constant budget and the meters don’t jump so much. I might even start doing all the scrolling copies when idling!
Maybe at some point if I want to do some framerate-dependent things (like doing more tile animations if you’re idle) I would optimise those things, but for now I think it’s more beneficial to have the frame budget allocation be constant, so I can be sure there will be no frame drops.
One thing that can be optimised though is that collision only needs to be checked every other frame, because player input is sampled on the "slow tick", so as far as the game is concerned the player sprite only moves in steps of 4 pixels per frame. Should halve the time spent there, saving about 3%.