Z80Babel: C++, D, Rust, Zig and Fortran

Página 3/4
1 | 2 | | 4

Por geijoenr

Champion (352)

Imagen del geijoenr

14-02-2022, 20:35

right! so it just a problem with the output syntax of the generated code not being compatible with the gnu assembler. With the patch provided in https://github.com/codebje/ez80-toolchain the whole toolchain actually does work!

Thanks a lot for the pointer Giangiacomo!

Por Tarnyko

Resident (43)

Imagen del Tarnyko

17-02-2022, 13:14

Wow! That's really impressive. Being a Rust developer myself, any such backend is a huge step up!

My main Rust usecase today, is a blockchain that takes no more than 1 Gigabyte of memory and a few dozens of megabytes of base storage with about the same network volumetry. A breeze! Just going to see how Z80Babel may help there [*].

[*] : couldn't resist doing it, don't feel compelled to answer my trolling self ^^
(there may be a way to port some libraries and write a little SymbOS toolkit though !)

Por Ped7g

Resident (61)

Imagen del Ped7g

22-02-2022, 17:46

so, I was a bit curious how I would imagine near-perfect C compiler output for the sieve_c_1, and wrote it in hand-assembly...

I'm almost done with it, just cleaning up remaining kinks and testing it, and did check the repo, and there is now actually "HAND_CODED_ASM" added to the repo, and it's
a) surprisingly close to what I have, although slightly different
b) not 100% correct... Smile

I will post my version when I will finish the cleanup, so I will just post issues I see with the HAND_CODED variant:

- for work_size=1 it will erase whole memory and crash (ldir with bc=0).
- for work buffer perfectly aligned to end of 64ki address space the `work+work_size` will be 0x10000, causing the routine to never detect end of for loops (BC=0x10000&0xFFFF==0)
(I don't know how MSX memory map looks, but on ZX the RAM is at 0x4000..0xFFFF, so having work buffer at end is acceptable).
- similarly having work_size of odd value and and work+work_size resulting to 0xFFFF will make _begin_loop_1 infinite, as the +=2 will wrap address to 0x0000 and that's < 0xFFFF.
- the inner loop could also wrap around infinitely for work buffer too near the end of RAM when work[i] is legal, but work+i+i+i wraps around address space (`add hl,de` ahead of "; de 2 * i " comment)
- and again after work[j]=1 the `add hl,de` is unprotected and may wrap-around and cause infinite loop

(and I know, because I have a bit different logic, but ended up crashing the ZX emulator in first versions because of these issues... before making the tests mathematically correct).

Hopefully I will manage to cleanup my version today and post it, it caught me a bit by surprise how difficult it is to write this thing correctly with reasonably simple code. Anyway, that's the "what C compiler hypothetically can output" attempt, then I will try one more doing real hand-assembly, with all the dirty assumptions and shortcuts which are not part of the original C code and can't be derived from it, so somewhat cheating... just for some comparison. But my current attempt is like ~100 bytes long, which seems to be still far ahead judging by the older info captured in screenshots. (I didn't run latest version of test, as I don't have z88dk and other tools installed, and not wanting to set it all up, so I'm just following the source and committed results. Smile )

Por salutte

Master (162)

Imagen del salutte

22-02-2022, 22:57

There are tons of updates, but my computer died just before my last commit and I'm trying to recover stuff.

@Ped7g, you are fully right, the asm is not fully equivalent to the C version. I just coded it in a haste to get a reference as a quick and dirty solution, and as soon as I got it green I didn't look back! If you can send me your implementation (or shoot a pull request) I'd be super happy to replace my implementation and use yours!

Some of the updates are:
1. Added a quicksort and a md5 benchmarks.
2. Being able to compare multiple compiler flags and combinations.
3. It's way easier to add new tests (at the cost of some of the most horrible c preprocessor abuse ever).
4. Removed Fortran support as I could not get eithed the avr or the z80 frontends to work.
5. Added footnotes to improve information.
6. Making it easier to reproduce by documenting all the dependencies better, and adding self-explanatory errors in case some tool.is missing (it is still a huge pain to set up all pipelines though)
7. Replaced default headers for C++ and C for betted compatibility.
8. Appending pre-compiled dependencies for Rust.
9. Improved readability and structure of the project.

By far, what I have been working the most with the help of Santi Ontañon is on the integration of MDL and evaluating its performance. I want this support to be rock solid before publishing a new update.

There is still a ton of things to do, including alternate versions of the new benchmarks and cleanup, but it is getting solid.

Pd. i do have three use cases for this tool. The main one is to integrate some C++ code in a game project I am working on. Then I want to use D and its compile-time calculations to populate constant tables. And finally, I want to test my deep learning framework in a z80 (written in C++). Let's see...

Por santiontanon

Paragon (1695)

Imagen del santiontanon

23-02-2022, 05:15

Sorry to hear about your computer, I hope you can recover everything!

But omg, you want to test a deep learning framework on a z80! lol! Well, maybe you can do inference of an already trained tiny model, could be fun! Many giant models these days use bfloat16 for the activations (at least in language, not sure in vision, but you'll know that better than me Smile ). So, once trained, probably 16bit floating points are fine on a z80 too Smile But training is probably a no-no hahaha

Por salutte

Master (162)

Imagen del salutte

23-02-2022, 05:42

Hahaha, inference of course, and I'll use a mix of floating point representations, a very small one to store the model and a more precise one to do the aggregation. A couple years ago we did a fun experiment to ultra-compress a DL model in a novel way, and we crushed a 9 layer network to a 3 layer net. The size of the model is huge, but it uses comparatively little FLOPS, so it should be able to classify an image on an MSX in less than 24 hours, I hope Big smile But of course it's just a crazy pet project.

Por Ped7g

Resident (61)

Imagen del Ped7g

23-02-2022, 12:01

uh, toying with it too long... so here is my C-like hand written asm for sieve: https://gist.github.com/ped7g/c55bfa0d55ca13ce029549636cdd1de5
(syntax is official Zilog + sjasmplus directives, so may need patching to z88dk z80asm - I'm not familiar with those non-official syntax tools, but I guess that shouldn't be difficult to fix ... also the ABI is custom-tailored and to call it from C it would need some wrapper function transforming arguments to expected registers and preserve whatever has to be preserved).

What means "C-like asm": I tried hard to use only information which can be machine-derived from original `sieve_c_1` C source, ie. hypothetically ideal compiler could produce the code from the C source (unless I slipped somewhere and used some human-derived guess :) ), and the code respects all value types, so it should work correctly even when you put 63ki work buffer onto it on some hypothetical platform with full 64ki of RAM free.

BTW the "primes" and "work" buffer can overlap (work starting at primes+2) to maximise the number of primes returned.

I'm now thinking about writing "real" asm version with all kinds of human assumptions and shortcuts, having extra constraints for caller, optimised solely for performance and probably using work buffer in different way (only odd numbers), for further comparison. But it feels the performance gain will be not that huge, the current code feels decent and I don't see too many extra tricks possible.

BTW2: this is surprisingly similar to asm version currently included in z80_babel, except that one can bug out for some input values as mentioned in my previous reply, my version should be more robust and so more compiler-like. I think my version may be somewhat slower due to extra checks of validity of pointers, but maybe not, some of the arithmetic is more streamlined, so maybe it will be almost on par.

Por geijoenr

Champion (352)

Imagen del geijoenr

23-02-2022, 18:24

Maybe not such a crazy idea! There is that recent trend of running models on the edge in small MCU's. At least is totally possible to run models in a Cortex M4, but that is quite bigger than a R800 and can run Linux.

Por Ped7g

Resident (61)

Imagen del Ped7g

25-02-2022, 04:59

And this is version 2, not trying to imitate hypothetical C compiler, but just going for something what I would write in asm-only project: https://gist.github.com/ped7g/1602d2e165850b55f6a52749d9811544

It departs in algorithm from sieve_c_1 by using work buffer for odd numbers only (so it can check twice as many numbers with same work_size).

With 1024 bytes buffer it does produce 309 primes (like C routine with 2048 buffer in z80_babel test run), and it takes about 46ms at ZX Spectrum at 3.5MHz (would be probably slightly slower on MSX due to extra T states on some instructions, probably like 50-55ms), code is 138 bytes.

And in direct comparison with my previous C-like version, when tuning the work_size to fit single frame in emulator (20ms), the first version can produce first 95 primes, second version can produce first 150 primes, so about +57% speed up.

Unfortunately it's using hard-coded buffers and constants, so this one would be harder to incorporate into babel project, I made it more to satisfy my own curiosity than to worry about fitting it as test case (but the first version on previous link should be reasonable easy to add to project, if somebody wants to).

And my personal curiosity (to compare with C compilers) is satisfied now, so I'm done with this (will keep an eye on this thread for few days in case there are some questions or comments). Cheers.

Por salutte

Master (162)

Imagen del salutte

25-02-2022, 18:41

Hi Ped7g, I got your first version running in z80babel! It's a little bit faster than mine Big smile

I will try to make the 2nd version running, (I can cheat a little bit and give fixed addresses to the buffers).

Thanks!

Update: got it running without a problem!
Your updated algorithm takes 58ms per iteration, while your old algorithm takes 92ms and my algorithm takes 94ms. Then the next one is C/C++/D which take 222ms.

Página 3/4
1 | 2 | | 4