Interesting paper ... from way back in the days when assembly was still a thing
I also ran into space optimization problems for my music replayer, specifically on the tracks. It's MML based, so basically it's about finding pattern in strings, and building sequences with those patterns. I used Python, and an extension module for "regular expression" (RegEx) handling. And it worked pretty good.
I'm guessing that switching from strings to bytes might be not as easy as it sounds in Python, but it might be a road to explore, so that it can find "patterns" in code, and built "sequences" with it.
Although, working with MML was easy because you know in advance what the simplest pattern will be (in my case: from 3 to 5 ASCII charaters specifying the note and its length, for example "dA2" for a 16th A2 note). So the optimization engine is looking for repetitions of those patterns. It might be different for code. Hence the use of that "Patricia tree".
Somehow all the good research in our field is done decades ago already . But it’s not specific to assembly I think, but to all imperative programming languages.
Subroutines basically serve two purposes, one is as a programmer abstraction, the other is an explicit form of space optimisation (with the C++ “inline” keyword being the time optimisation version of it).
Ideally the programmer would only worry about the abstraction side of things, to let him express the program as intuitively as possible, and the compiler would deal with the optimisation. At most they would have to mark parts of code as “optimise for size” or “optimise for speed” (the hot code paths).
But as Z80 programmers we’re worrying a lot about inlining and unrolling on the one end (to save time), and making subroutines, fall-throughs and tail calls on the other (to save space). It’s a very manual process. While really the place and order of the subroutines doesn’t matter as long as the logic remains the same.
What makes this topic interesting is that it integrates speed and size optimisation into a general problem; it works both ways. The control flow is separated from the code, and can be freely manipulated by the optimiser to achieve either goal. Even as a linter it can be useful to help programmers identify where they have code duplication and what could be better call hierarchies, and where it would either save a lot of space or time.
I just skimmed the paper, and it's a very cool idea! It'd be super cool to add such type of features to MDL! As I mentioned above, I'm on a little "break" of MDL right now, but I'd like to pick it up in again in a couple of weeks. The first thing is to finish SDCC support, which is 75% of the way there already. But next I'd like to start a new optimizer from scratch. I was thinking of a register optimizer, or a search-based optimizer, but this idea seems pretty cool, so, I might prioritize it!
Beware that, as the author of the paper says, it has a negative impact on speed, as CPU time needed is greater. The author says the gain comes only from the loading time being reduced, due to the smaller file size. This gain will be null in most MSX cases, as most of the time, it's a cartridge that's being used, with zero loading time. And even if it's the case with a disk file for example, it's the speed of execution we're after.
I agree that it permits to focus more the programmer's attention to optimizing the bits of code that are "subroutined" as a result, but at the cost of 29 cycles (call+ret) times the number of time the subroutine is called. That's why unrolling is usually a best practice.
But perhaps that's the end result. Isolating bits of code that are repetitive and could be "subroutined', so that the programmer focuses on optimizing them to achieve better performance. And then ... unrolling them.
Of course the point with this is to optimise for size, not speed. This is useful for 16K or 32K cartridge games which need more room for levels, music, etc.
The nice thing is that the same process can be inversed to inline subroutines. I think. The structure for it, and thus the necessary information, is there to turn a program entirely into one without any calls. Or apply it selectively.
Perhaps tying in to your register optimizer plans, another paper often referenced together with Fraser et al.’s paper: Enhanced Code Compression for Embedded RISC Processors
This paper explores one technique for reducing codesize—code compression during the late stages of compilation. The approach is conceptually simple; it builds on early work by Fraser et al. for Vax assembly code. Pattern-matching techniques identify identical code sequences (a “repeat”). The compiler then uses either procedural abstraction or cross-jumping to channel execution of the repeat through a single copy of the code. We extend this basic algorithm by relaxing the notion of “identical” to abstract away register names – a key enhancement when compressing code compiled with a graph-coloring register allocator.
I’m still reading it but it’s nice because it gives another take on Fraser’s algorithm and then builds on it. It also has a lot of useful references like for example to Ukkonen’s algorithm to build a suffix tree in O(n) time. Wonderful, O(n) is what I want for my build scripts.
I also found references to other papers about incorporating instruction reordering (maybe more useful for the Z80 as the instruction set is nonorthogonal). As is often the case with academic papers though, some are behind paywalls.
Thanks for all of these papers Grauw! My research area is quite far from programming languages/systems, and I am not familiar with this literature at all. I will go through them asap!
At least you’re familiar with reading academic papers . The Ukkonen paper makes my eyes glaze over despite it claiming that it’s an intuitive approach! Luckily there is StackOverflow and this answer here gives a very nice explanation of the algorithm.
Although I need to deal with only a subset of the problems for the case of compressing a “stream of register writes”, I’ll let you know how it goes.
I’ve implemented Ukkonen’s algorithm, it’s not too bad. I found a nice online visual demonstration. E.g. input “abcabxabcd” and you can step through the algorithm. Together with other descriptions it could help to develop an intuition for the algorithm.
Now how to apply it, I’m trying to find a good description :). The papers I quoted earlier kind of glance over the details of how to process the suffix tree into an optimal solution. I think essentially a program is like a ordered tree that is traversed depth-first, and what we want to achieve is to convert this tree into a minimal directed acyclic graph (where each node pointed to by multiple others is a subroutine). I’m trying to think of it in a bit more mathematical terms to maybe find more results.
(Of course in my case I can simplify the problem to just subroutines and I don’t have to worry about there being jumps and such in the source data. Well aside from the loop point but that’s pretty easy to deal with by just inserting a unique marker, it will automatically go to the top level and I don’t need to worry about stack overflow by jumping into subroutines.)
II think essentially a program is like a ordered tree that is traversed depth-first, and what we want to achieve is to convert this tree into a minimal directed acyclic graph (where each node pointed to by multiple others is a subroutine).
I don't think a program is a DAG. Loops are cycles in the graph.
Besides, don't aim for minimal, it's hopeless. https://en.wikipedia.org/wiki/Full_employment_theorem
I’m trying to think of it in a bit more mathematical terms to maybe find more results.
Look up Control Flow Graph. Also, recommended read:
http://staff.cs.upt.ro/~chirila/teaching/upt/c51-pt/aamcij/7...
Chapter 8.2 describes the basics of CFGs.