Ukkonens suffix tree algorithm in plain English?
|February 26, 2012||Posted by forumadmin under TechQns||
I feel a bit thick at this point. I’ve spent days trying to fully wrap my head around suffix tree construction, but because I don’t have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I’ve found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.
A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I’m sure.
For reference, here’s Ukkonen’s paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
My basic understanding, so far:
- I need to iterate through each prefix P of a given string T
- I need to iterate through each suffix S in prefix P and add that to tree
- To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.
The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen’s algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I’m having trouble understanding.
I’m also having trouble understanding:
- exactly when and how the “active point” is assigned, used and changed
- what is going on with the canonization aspect of the algorithm
- Why the implementations I’ve seen need to “fix” bounding variables that they are using
EDIT (April 13, 2012)
Here is the completed source code that I’ve written and output based on jogojapan’s answer below. The code outputs a detailed description and text-based diagram of the steps it takes as it builds the tree. It is a first version and could probably do with optimization and so forth, but it works, which is the main thing.
[Redacted URL, see updated link below]
EDIT (April 15, 2012)
The source code has been completely rewritten from scratch and now not only works correctly, but it supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:
|Asked By – Nathan Ridley||Read Answers|