You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-commits@xmlgraphics.apache.org by Apache Wiki <wi...@apache.org> on 2005/05/04 15:31:59 UTC

[Xmlgraphics-fop Wiki] Update of "AutomaticHyphenation" by JoergPietschmann

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Xmlgraphics-fop Wiki" for change notification.

The following page has been changed by JoergPietschmann:
http://wiki.apache.org/xmlgraphics-fop/AutomaticHyphenation

------------------------------------------------------------------------------
    end if
  end for
  }}}
- After this, odd integers in the hyphenation position vector mark valid hyphenation points, while even integers (including zero) denote potitions unavailable for hyphenation.
+ After this, odd integers in the hyphenation position vector mark valid hyphenation points, while even integers (including zero) denote positions unavailable for hyphenation.
  
  === Data Details ===
  
@@ -95, +95 @@

  }}}
  The result is a possible hyphenation as para-meter, indicated by the weight 3 at the 5th position, while all the zeroes and the weight 2 at position 4 inhibit hyphenation.
  
- The order of substring matching doesn't matter, scanning the substrings e.g. by substring length leads to the same result.
+ The order of substring matching doesn't matter, scanning the substrings e.g. by substring length rather than by start position leads to the same result.
  
  The example also demonstrates why the hyphenation position vector extends left and right beyond the boundaries of the word: patters may have weights associated with the position before or after the end of the pattern, and extending the hyphenation position vector saves explicit testing whether such a pattern matches at the beginning or the end of the word. The positions are clipped after the weights of all matching patterns have been merged, which is much more efficient.
  
@@ -151, +151 @@

  
  == Efficient Pattern Storage Implementation ==
  
- Keywords: suffix tree, trie, ternary tree, PATRICIA tree, etc.; link DADS and freshmeat projects
+ === Tries, Trie Implementations and Derived Data Structures ===
  
- The pattern storage can be basically any data structure which can map strings to an integer array, for example a hash map. However, a word has O( length(word)^2 ) substrings, and the word "parameter" requires 66 lookups (remember the sentinel characters, which count too). On the other hand, matches are rare. There are some obvious ways to reduce the number of lookups. For example, if there are no strings starting with "para" in the map, we don't have to look up "param", "parame", "paramet" etc.
+ The pattern storage can be basically any data structure which can map strings to integer arrays, for example a hash map. However, a word has O( length(word)^2 ) substrings, and the word "parameter" requires 66 lookups (remember the sentinel characters, which count too). On the other hand, matches are rare. There are some obvious ways to reduce the number of lookups. For example, if there are no strings starting with "para" in the map, we don't have to look up "param", "parame", "paramet" etc.
  
- A data structure specifically designed for storing a mapping from strings to arbitrary values is the [http://www.nist.gov/dads/HTML/trie.html trie] ([http://en.wikipedia.org/wiki/Trie Wikipedia link]). While tries look like trees, the key strings are not stored in the tree nodes; rather, a key string is represented by the path from the tree root to a leaf. An example of a trie storing the strings "am", "aram" and ".para":
+ A data structure specifically designed for storing a mapping of strings to arbitrary values is the [http://www.nist.gov/dads/HTML/trie.html trie] ([http://en.wikipedia.org/wiki/Trie Wikipedia link]). While tries look like trees, the key strings are not stored in the tree nodes; rather, a key string is represented by the path from the tree root to a leaf. An example of a trie storing the strings "am", "aram" and ".para":
  {{{
  root -*- a -*- m -> leaf
        |     |
@@ -163, +163 @@

        |
        +- . -*- p -*- a -*- r -*- a -> leaf
  }}}
- The nodes a represented as a "*".
+ The nodes are represented as a "*".
  
- There are multiple ways to implement the concept of a trie. In the original implementation by Liang and Knuth in TeX, a trie node is basically a sorted array of bytes which holds the downpath characters and a corresponding array of C pointers to the corresponding nodes. The implementors faced the usual tradeoff between space and access time and between insert time and find time complexity. In natural languages, the number of bigrams and trigrams (pairs and triples of successive characters) is limited. Some of the trie node character arrays would be quite full, in particular near the root, but further down the tree node usually fork only into a few nodes further downpath. 
+ There are multiple ways to implement the concept of a trie. In the original implementation by Liang and Knuth in TeX, a trie node is basically a sorted array of bytes which holds the downpath characters and a corresponding array of C pointers to the corresponding nodes. The implementors faced the usual tradeoff between space and access time and between insert time and find time complexity. In natural languages, there is a high correlation between successive characters in words. Some of the trie node character arrays would be quite full, in particular near the root, but further down the tree node usually forks only into a few nodes further downpath. 
  
- The FOP implementation has been called "ternary tree", and a trie node here is an ordinary binary tree itself, with the characters of the trie node as keys. The FOP ternary tree merges both the pointer structure of the trie and the binary tree of the trie nodes. BTW there are no mechanisms to keep the binary trees of the trie nodes balanced. The FOP hyphenation compiler first inserts the patterns as they come from the file (which leads to grossly unbalanced trees), then the data is copied using a median strategy, which generates balanced binary trees for the trie nodes.
+ The FOP implementation has been called "ternary tree", and a trie node here is an ordinary binary tree itself, with the characters of the trie node as keys. The FOP ternary tree merges both the pointer structure of the trie and the binary tree of the trie nodes. BTW there are no mechanisms to keep the binary trees of the trie nodes balanced. The FOP hyphenation compiler first inserts the patterns as they come from the file (which leads to grossly unbalanced trees), then the data is copied using a median strategy, which generates balanced binary trees for the trie nodes. Finally, in order to make the confusion complete, the FOP ternary tree doesn't use Java objects to represent the tree nodes but integer and character arrays.
+ 
+ As already noted, when storing substrings of natural language words in a trie, the further down the trie node are the more likely they are sparsely populated. The extreme case, only one child node, occurs quite often. In the example above, there are only two nodes with more than a child: the root node and the node below the "a" arc from the root node. This may seem wastful, both in terms of storage and possibly lookup time (pointer dereference and perhaps the non-locality of memory access). A natural optimization would be to store sub-tries with only one leaf or intermediate sequences of trie nodes with only one child in a more compact form. This leads to [http://www.nist.gov/dads/HTML/patriciatree.html PATRICIA trees] [http://en.wikipedia.org/wiki/Patricia_trie Wikipedia link] and other specializations.
+ 
+ Search [http://freshmeat.net FreshMeat] for sample implementations of [http://freshmeat.net/search/?q=trie&section=projects&Go.x=0&Go.y=0 tries]. Google will turn up some more.
+ 
+ === Alphabet Transformation ===
+ 
+ The hyphenation patterns for a certain language usually use only a small subset of all Unicode characters, often from a single Unicode block. Storing the characters as Java char (16 bit) may seem wasteful. The usual advice is to  map the used characters onto small numbers which will fit into a byte.
+ 
+ === Weight Compression ===
  
  == Refinements And Embedding Of the Algorithm ==
  
+ In order to use the algorithm above, some more surrounding code is necessary. 
  Keywords: parse words out of the text, character normalization (link Unicode TR, icu4j etc.), alphabet mapping/transformation
  
  == How To Generate Patterns ==

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-commits-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: fop-commits-help@xmlgraphics.apache.org