I believe that well over ten years ago suffix arrays were becoming more preferable to suffix trees in some fields like bioinformatics for reasons like lower memory usage. They represent mostly the same thing, but I tried googling for suffix arrays in relation to compilers and found nothing of interest. Maybe you'd have better luck, but I suppose suffix trees are easier abstractions to handle.
Implementing algorithms from the original articles tends to be the harder option than checking some online lecture notes. I think the latter was the way I used when I implemented Ukkonen's algorithm in my previous life. (Plus I once implemented one algorithm from the original mid-80s article, only to find the algorithm didn't work and a fix was released in a later issue of the journal.)
Hehe well I managed to make my way through. But thanks for the tip, I’ll broaden my search a bit . I notice finding the right keywords (like “procedural abstraction” and “code compaction” rather than “compression”) also makes a lot of difference for what you can find. Amongst others I found a survey of different techniques and a thesis about code compaction (not read yet), which seemed easier to read than papers like Ukkonen’s, so there are also differences there.
I think indeed if lowering memory use of the compressor was a goal, suffix arrays would be interesting. I’m sure they see many applications in databases, search engines, DNA analysis, etc.
However since I’m not working with big data, I haven’t looked into them much since it seemed to me working with the more explicit tree structure is more convenient. Just like working with a tree structure is generally more convenient than a tree postfix-encoded as a sequence. (Though I found several articles about detecting repeats in ordered trees by postfix encoding them and feeding them into a suffix tree.)
As long as the algorithms are O(n) or O(n log n) it’s probably good enough for my purposes.
Btw I found a project called Squeeze, referenced in a paper, which do code compaction.
Oh wow, didn't realize there's so much literature on the topic! In one of the papers on the squeeze project they claim 30% reduction in code size in average, which is quite impressive! (MDL gets 1% at best!). I suspect savings from binaries directly written in assembler will be lower, as there'll be much less boiler plate code than in those generated from C/C++, but still!
I implemented a first version which just does a single repetition elimination pass on the “stream of commands” type music files generated by my gngplay project. The repeat finding is still very basic, and also not optimised though it runs pretty quick. But I wanted to share my first results of running it on the Ghosts ’n Goblins music files:
Track Original Compressed Reduction 01 132 73 45% 02 657 471 28% 03 286 206 28% 04 2052 1355 34% 05 1863 923 50% 06 486 329 32% 07 413 249 40% 08 1817 1088 40% 09 763 513 33% 10 2130 1206 43% 11 2742 1707 38% 12 1214 704 42% 13 3449 2131 38% 14 392 304 22% 15 2269 1595 30% 16 831 471 43% 17 304 206 32% 18 653 491 25% 19 3782 2076 45% 20 519 397 24% 21 1722 1284 25% 22 473 339 28% 23 209 169 19% 24 890 465 48% 25 272 230 15% 26 80 70 12% 27 170 108 36% Total 30570 19161 37%
Note:
- The sizes are expressed as nr. of commands
- Call / return command overhead for repeats is not counted
- It goes only 1 level deep, repeats inside repeats are not found
So although I did not count certain factors contributing to final file size, there’s also a lot of missed opportunity still. I also haven’t iterated on various strategies to improve the compression, like different pattern search ordering or reverse prefix trees.
Results do not translate to assembler files already procedurally abstracted by the programmer.
I note from the above table that generally speaking the size reduction improves as the files get bigger. So I concatenated the data of all the tracks together and it gets a little smaller; 30570 -> 18216 commands, a reduction of 40%. Mh. Longer music means more repetition I guess, and the tracks don’t share too much between each other.
My current implementation functionally matches the “LFS” algorithm in the paper Simple Linear-Time Off-Line Text Compression by Longest-First Substitution. They also describe an improvement called “LFS2” which is what I describe in my third point above. For their test corpus they get a 54% size reduction for LFS and 80% (another 55%) for LFS2.
So if I extrapolate this to my results, if the size reduction for the music files is 37%, for the LFS2 algorithm I expect an additional size reduction of about 37%, or a 60% reduction in total. If I look at the sizes of the repeats found, several large ones that are internally still uncompressed, such a further reduction would be in line with my expectations.
p.s. The MFFS (Re-pair) method it mentions works bottom-up from shortest rather than longest matches, see here (chapter 7). It does not use a suffix tree but is its own stand-alone algorithm, so it may be simpler to implement with comparable results.
This is very interesting! Thanks for the update Grauw! I have not yet had a chance to read through the algorithm in detail, but from the explanations you've been putting here, it seems very related to adaptive dictionary compression methods like LZW, right? trying to find common subsequences. So, in addition to using it for compressing code, I wonder if this could also be used as a compression algorithm for data rather than commands.
Also, I'm curious to see how would the call/ret overhead play with the numbers. However, even if the compression goes down from 35% to 20% or even 10% when including that overhead, that's still much better than the current levels of compression MDL can do, which I'd say are around the single digit % number. So, this is all very promising!
Yeah LZ78 (which LZW is based on) is related, although from what I understand it’s less optimal than LFS2 or Re-pair because the dictionary itself contains duplication due to the way it is constructed (it is not irreducible).
These off-line dictionary compression techniques (which form a context-free grammar) apply to any type of data, but due to their hierarchical nature they are also suitable for code. They do not need a decompression buffer and can be processed straight from ROM, while on-line LZ77-based algorithms require RAM because the dictionary is built on the fly.
Some ret overhead and call stack depth can be reduced by eliminating tail recursion. Ordering subroutines such that they follow a tail recursion allows you to eliminate tail jumps as well. The opportunities for tail call optimisation might be improved by reversing the input before compression.
About the percentages, note that my input data contains ample repetition, while program code already has repetitions removed by the programmer as subroutines, plus repeats that conditionally branch or unbalance the call stack must be split or rejected. Nevertheless I think typical code still contains repetitions in a bunch of places.
p.s. Given a program execution trace, suffix trees can also be used to identify hot code paths.
Hi optimizers!
back in July I hinted in one reply of this thread about GNU binutils for z80 being usable.
So, as an exercise, I ported wonderful (SCC sound driver powered) cobinee's LR
to GNU assembly for z80.
Here a .zip of the ported program worked out in a Lubuntu laptop with GNU binutils 2.35 .
I know I went out of time and this trail is cold now. But better late than ever. :P
oh, interesting! other than the assembler, which gnu tools do support z80? can you use gprof or gdb with z80 binaries? (and if you do, do you run it directly on MSX or on the Linux/Mac/Windows side?)
@santiontanon : GNU as cross assembler for z80 does not support compiling with `-pg`, and gprof cross profiler should have a way for pushing data from target to host, such a semihosting mechanism. So it is not enabled yet.
gdb for z80 is not usable yet, Sergey Belyashov a.k.a. b-s-a is actively contributing among other volunteers.
Ciao,
Giacomo
Old numbers without repeats-in-repeats on the left (slightly updated), and numbers that also include compressing repeats inside repeats on the right.
Track Original Compressed Reduction Compressed Reduction 01 132 68 48% 58 56% 02 657 456 31% 318 52% 03 286 195 32% 140 51% 04 2052 1335 35% 935 54% 05 1863 910 51% 514 72% 06 486 317 35% 226 53% 07 413 241 42% 95 77% 08 1817 1072 41% 484 73% 09 763 495 35% 364 52% 10 2130 1190 44% 519 76% 11 2742 1676 39% 1085 60% 12 1214 695 43% 319 74% 13 3449 2116 39% 1247 64% 14 392 282 28% 230 41% 15 2269 1578 30% 1221 46% 16 831 456 45% 126 85% 17 304 198 35% 175 42% 18 653 476 27% 366 44% 19 3782 2069 45% 742 80% 20 519 382 26% 328 37% 21 1722 1250 27% 1024 41% 22 473 325 31% 245 48% 23 209 157 25% 120 43% 24 890 461 48% 166 81% 25 272 216 21% 188 31% 26 80 64 20% 58 27% 27 170 100 41% 90 47% Total 30570 18780 39% 11383 63%
Note:
- The sizes are expressed as nr. of commands
- Call / return command overhead for repeats is not counted
So this matches my earlier prediction of “an equal additional size reduction”.
When I zip the original files it also gives a 63% size reduction (from 60k to 22k). So on the surface the compression is comparable. But of course I didn’t count the cost of the call / return commands, and zip also applies Huffman compression to further reduce the size. Still I think it’ll get pretty close and be a good compression.