Wavelet Tree

The Wavelet Tree is a succinct data structure to store strings in compressed space. It generalizes the $\mathbf {rank} _{q}$ and $\mathbf {select} _{q}$ operations defined on bitvectors to arbitrary alphabets.

Originally introduced to represent compressed suffix arrays,^[1] it has found application in several contexts.^[2]^[3] The tree is defined by recursively partitioning the alphabet into pairs of subsets; the leaves correspond to individual symbols of the alphabet, and at each node a bitvector stores whether a symbol of the string belongs to one subset or the other.

The name derives from an analogy with the wavelet transform for signals, which recursively decomposes a signal into low-frequency and high-frequency components.

Properties[edit]

Let $\Sigma$ be a finite alphabet with $\sigma ={|\Sigma |}$ . By using succinct dictionaries in the nodes, a string $s\in \Sigma ^{*}$ can be stored in ${|s|}H_{0}(s)+o({|s|}\log \sigma )$ , where $H_{0}(s)$ is the order-0 empirical entropy of $s$ .

If the tree is balanced, the operations $\mathbf {access}$ , $\mathbf {rank} _{q}$ , and $\mathbf {select} _{q}$ can be supported in $O(\log \sigma )$ time.

Access operation[edit]

A wavelet tree contains a bitmap representation of a string. If we know the alphabet set, then the exact string can be inferred by tracking bits down the tree. To find the letter at i^th position in the string :-

Algorithm access
  Input:
    - The position i in the string of which we want to know the letter, starting at 1.
    - The top node W of the wavelet tree that represents the string
  Output: The letter at position i

    if W.isLeafNode return W.letter
    if W.bitvector[i] = 0 return access(i - rank(W.bitvector, i), W.left)
    else return access(rank(W.bitvector, i), W.right)

"←" denotes assignment. For instance, "largest ← item" means that the value of largest changes to the value of item.
"return" terminates the algorithm and outputs the following value.

In this context, the rank of a position $i$ in a bitvector $b$ is the number of ones that appear in the first $i$ positions of $b$ . Because the rank can be calculated in O(1) by using succinct dictionaries, any S[i] in string S can be accessed in $O(\log \sigma )$ ^[3] time, as long as the tree is balanced.

Extensions[edit]

Several extensions to the basic structure have been presented in the literature. To reduce the height of the tree, multiary nodes can be used instead of binary.^[2] The data structure can be made dynamic, supporting insertions and deletions at arbitrary points of the string; this feature enables the implementation of dynamic FM-indexes.^[4] This can be further generalized, allowing the update operations to change the underlying alphabet: the Wavelet Trie^[5] exploits the trie structure on an alphabet of strings to enable dynamic tree modifications.

References[edit]

^ R. Grossi, A. Gupta, and J. S. Vitter, High-order entropy-compressed text indexes, Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA), January 2003, 841-850.
^ ^a ^b P. Ferragina, R. Giancarlo, G. Manzini, The myriad virtues of Wavelet Trees, Information and Computation, Volume 207, Issue 8, August 2009, Pages 849-866
^ ^a ^b G. Navarro, Wavelet Trees for All, Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), 2012
^ H.-L. Chan, W.-K. Hon, T.-W. Lam, and K. Sadakane, Compressed Indexes for dynamic text collections, ACM Transactions on Algorithms, 3(2), 2007
^ R. Grossi and G. Ottaviano, The Wavelet Trie: maintaining an indexed sequence of strings in compressed space, In Proceedings of the 31st Symposium on the Principles of Database Systems (PODS), 2012

External links[edit]

Media related to Wavelet Tree at Wikimedia Commons

[GGV03-1] R. Grossi, A. Gupta, and J. S. Vitter, High-order entropy-compressed text indexes, Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA), January 2003, 841-850.

[FGM09-2] P. Ferragina, R. Giancarlo, G. Manzini, The myriad virtues of Wavelet Trees, Information and Computation, Volume 207, Issue 8, August 2009, Pages 849-866

[Navarro12-3] G. Navarro, Wavelet Trees for All, Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), 2012

[CHLK07-4] H.-L. Chan, W.-K. Hon, T.-W. Lam, and K. Sadakane, Compressed Indexes for dynamic text collections, ACM Transactions on Algorithms, 3(2), 2007

[GO12-5] R. Grossi and G. Ottaviano, The Wavelet Trie: maintaining an indexed sequence of strings in compressed space, In Proceedings of the 31st Symposium on the Principles of Database Systems (PODS), 2012

[1]

[2]

[3]

[4]

[5]

Properties[edit]

Access operation[edit]

Extensions[edit]

Further reading[edit]

References[edit]

External links[edit]