Cool! Did you do any more progress on this @Grauw?
As for the "program is a DAG" or not, @pgimeno, if I understand correctly, this is a particular case, where the idea is to first linearize a program (or a section of the program). So, there are no loops here, etc. Then you want to find the way to recreate the same sequence using the minimum possible space, identifying subroutines, etc. Hence the DAG.
Thanks, I misunderstood. I sometimes read too fast, sorry.
@santiontanon I made some fuzz tests for the suffix tree algorithm that I implemented and it didn’t do very well in those . That StackOverflow explanation gives a nice intuition but it misses some specifics. So I spent some time reading a chapter from a book by Gusfield (1997) that this explanation referenced (and more or less copies), gained some more understanding and now it’s working.
I think this weekend I’ll try to process a couple of test suffix trees to extract some useful data. E.g. first challenge would be to convert something simple like XXXXXXXX to bb / b = aa / a = XX.
@pgimeno, thanks for those links :). The case I’m looking into this for is even simpler since I’m dealing with a fully linear stream of commands that I want to “subroutinise” to compress it. But the paper that I found that describes pretty much what I want to do actually applied the techniques to entire programs.
So, it's not a DAG if it contains recursive calls. It's not so frequent to have recursive calls in assembly, fortunately.
Cool, still busy with other projects/work these days, but will follow your updates for when I go back to MDL
I also have a question: imagine that you have a sequence ABCDEFGHIJ for which you want to build the suffix tree in order to generate a more efficient way to express this code. Now, until now we have been assuming that the order of the tokens in ABCDEFGHIJ is fixed, but in reality there might be some instructions that can be executed in arbitrary order, e.g. it might be that ABCD and ABDC are equivalent. Do you think that if in addition to the sequence you have all the dependencies between the tokens, so you know which can be reordered, there is an easy way to exploit this to compress even more?
The second paper I found touches upon it briefly:
Instruction ordering
The underlying pattern-matching mechanisms that support this form of compression are very sensitive to instruction ordering. If two fragments contain the same operations, but one fragment executes them in a slightly different (but semantically equivalent) order, then they will not be identified by the suffix tree as identical.
The example in Figure 17 illustrates the problem. The two fragments in question both perform the same operations, but the order of the operations is slightly different (the first two instructions are swapped). The compression framework should be able to overcome minor ordering differences when locating repeated fragments to optimize.
One way to attack this problem is to reorder instructions within basic blocks in a “canonical” fashion prior to constructing the suffix tree (subject to dependence constraints, of course), in the hopes that this will eliminate unimportant ordering variations. A second approach is to build a suffix tree that in some sense encapsulates multiple instruction orderings within particular basic blocks.
Establishing a canonical ordering would also be my first intuition. I imagine if you have a graph of the dependencies, where each node is an instruction and the edges express the dependencies, whenever there’s an instruction node with multiple edges from it, you sort the edges by their instruction (using a predefined instruction ordinal), and then move each of them forward as much as possible in-order.
I think when I was searching for references to these two papers, I saw some papers discussing instruction reordering as well.
Interesting! I like the canonical order idea! And actually it could be useful in another way too! In my MSXDev entry, many parts of the code were compressed, and only decompressed to RAM when needed, in order to pack as much as I could into the cartridge. I am sure compression would work best if instructions were re-ordered in my source code according to some canonical order, and thus maximize the chance of similar bytes sequences when compressing. Maybe it'd only save a few bytes here and there, but still, every byte counts!
Yeah, definitely, good point. Also the peephole optimiser may perform better if it operates on the dependency graph to find the patterns.
Another nice paper found: Procedural Abstraction with Reverse Prefix Trees. It reverses the operations pushed into the suffix tree to also identify cases where you can jump into subroutines at different locations, whilst keeping the single return at the end. Additionally it has a detailed description of the fragment discard criteria that I was missing from the other papers, both for regular suffix trees and reversed ones. The paper also references other research in the same space.
I found that using the search term “suffix tree functional abstraction” in Google Scholar gives several relevant results.
I’m not also able to read Ukkonen’s suffix tree paper. After going through the descriptions from stack overflow and Gusfield’s book, and implementation experience, I was finally able to understand it :).
I believe that well over ten years ago suffix arrays were becoming more preferable to suffix trees in some fields like bioinformatics for reasons like lower memory usage. They represent mostly the same thing, but I tried googling for suffix arrays in relation to compilers and found nothing of interest. Maybe you'd have better luck, but I suppose suffix trees are easier abstractions to handle.
Implementing algorithms from the original articles tends to be the harder option than checking some online lecture notes. I think the latter was the way I used when I implemented Ukkonen's algorithm in my previous life. (Plus I once implemented one algorithm from the original mid-80s article, only to find the algorithm didn't work and a fix was released in a later issue of the journal.)