You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by "J.Pietschmann" <j3...@yahoo.de> on 2003/11/29 22:34:43 UTC

[collections] [lang] Req: Trie/ternary tree data structure

Hi all,
I'd like to tap into the community wisdom for some thoughts.

  == Background ==
There are various task which require all substrings of a certain
string to be looked up in a collection. The application I'm
concerned with is pattern based word hyphenation, but there are
others (spell check, stemming).
An example: take the string "word". The task requires looking up
the 10 strings "w", "wo", "wor", "word", "o", "or", "ord", "r",
"rd" and "d".

The idea is to take advantage of the fact that the strings to look
up are related because they are substrings of another string. Another
point is that you know that you deal with strings/characters, not
arbitrary data as keys.
Liang, Knuth et al. used a trie for handling this efficiently, both
in terms of performance and memory. A node in a trie is basically
a sorted array of characters, so that the character can be looked up
quickly. The slot then points to the stored value as well as to
another node.
If it is known that a key string in the trie is never a substring
of another key string, the pointers to the value and the following
node can take up the same storage. All of the patters for hyphenation
i've seen have this property, but there is no guarantee.

The Apache FOP hyphenator uses a ternary tree as implementation.
Basically it's the same, except the sorted array is replaced by a
binary tree. Instread of using Java object for the tree nodes, tree
char arrays are used for storing indices to the respective following
nodes. Ideally, the binary trees should be balanced in order to provide
the same O(log(n)) time for looking up like the sorted arrays in the trie
do. Of course, if the ternary tree is build by inserting random key
strings this requires red-black-trees or a similar mechanism.
See
   http://cvs.apache.org/viewcvs.cgi/xml-fop/src/java/org/apache/fop/
   layout/hyphenation/TernaryTree.java?rev=1.3&
   content-type=text/vnd.viewcvs-markup
for the current state of the art.

An example how strings are looked up in a trie: Suppose the collection
contains the key strings "ord", "ot" and "rda".
The trie is roughly
     (0) o  r
        /    \
  (1) r t    d (2)
      | -    |
  (3) d      a (4)
      -      -

Now for the string lookup
i=0 j=0 c='w' no match in (0) -> continue
i=1 j=0 c='o' match 'o' in (0) -> (1)
     j=1 c='r' match 'r' in (1) -> (3)
     j=3 c='d' match 'd' in (3), leaf node, get value
i=2 j=0 c='r' match 'r' in (0) -> (2)
     j=1 c='d' match 'd' in (2) -> (4)
     eos but no leaf node -> continue
i=3 j=0 c='d' no match (0) -> continue

You see, none of the longer substrings starting with "w" is actually
looked up, which should provide some performance gain over simply using
a hashtable.

  == Requirements ==

The data structure is built once from mor or less randomly distributed key
strings. After this, it is used for many lookups. This means the key
strings can first be gathered and analyzed, then stored into an optimized
data structure.

The number of expected key strings is of the order of several thousend to
a few hundred thousand. Efficient memory usage is a must. Tradeoffs between
using less memory and lookup performance should be carefully considered.

It should be possible to make the optimized data structure persistent. It
should load fast from a data or serialized object file. Unfortunately, I
have no experience whether it is more advanteous to serialize a tree of
Java objects or an object with a one or few big arrays.

  == Considerations ==

1. As already said, for the current hyphenation patters no key string
is a substring at the beginning of another key. I.e if there is a "wor"
key, there won't be "word" or "wora" keys, but there may be a "kword"
key. This may allow some storage optimizations. (see
   http://cvs.apache.org/viewcvs.cgi/xml-fop/src/hyph/en.xml
   ?rev=1.2&content-type=text/vnd.viewcvs-markup
for an example, the keys are the strings in the <patterns> sections sans
the digits)
However for using such a data structure for stemming, there may be keys
which are also at the satrt of other keys. Actually, the curent state for
hyphenation patters is to some extent an artefact of how the patterns are
created, and it may well be it means we are using suboptimal patterns.

2. In most cases, the set of unicode characters in the key strings is limited.
It is possible that the charaters can be mapped onto bytes. For hyphenation,
a mapping of the characters of the original string is necessary anyway
(normalizing uppercase characters to lowercase etc.).

Ok, now what do you think?

Regards
J.Pietschmann



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [collections] [lang] Req: Trie/ternary tree data structure

Posted by Stephen Colebourne <sc...@btopenworld.com>.

The subject of Tries has come up before, it'll all be on the archives. I
think there even may have been an implementation floating around.

At the time I blocked its addition to collections because it strikes me as a
relatively specific collection. My view basically remains that - collections
is now very big, and has recently effectively split into three.

If you want specific guidance on something, someone may be able to help. Its
not something I've played with, although I can sort of understand the design
and why it might be used.

Not sure that help really ;-)
Stephen


----- Original Message -----
From: "J.Pietschmann" <j3...@yahoo.de>
> Hi all,
> I'd like to tap into the community wisdom for some thoughts.
>
>   == Background ==
> There are various task which require all substrings of a certain
> string to be looked up in a collection. The application I'm
> concerned with is pattern based word hyphenation, but there are
> others (spell check, stemming).
> An example: take the string "word". The task requires looking up
> the 10 strings "w", "wo", "wor", "word", "o", "or", "ord", "r",
> "rd" and "d".
>
> The idea is to take advantage of the fact that the strings to look
> up are related because they are substrings of another string. Another
> point is that you know that you deal with strings/characters, not
> arbitrary data as keys.
> Liang, Knuth et al. used a trie for handling this efficiently, both
> in terms of performance and memory. A node in a trie is basically
> a sorted array of characters, so that the character can be looked up
> quickly. The slot then points to the stored value as well as to
> another node.
> If it is known that a key string in the trie is never a substring
> of another key string, the pointers to the value and the following
> node can take up the same storage. All of the patters for hyphenation
> i've seen have this property, but there is no guarantee.
>
> The Apache FOP hyphenator uses a ternary tree as implementation.
> Basically it's the same, except the sorted array is replaced by a
> binary tree. Instread of using Java object for the tree nodes, tree
> char arrays are used for storing indices to the respective following
> nodes. Ideally, the binary trees should be balanced in order to provide
> the same O(log(n)) time for looking up like the sorted arrays in the trie
> do. Of course, if the ternary tree is build by inserting random key
> strings this requires red-black-trees or a similar mechanism.
> See
>    http://cvs.apache.org/viewcvs.cgi/xml-fop/src/java/org/apache/fop/
>    layout/hyphenation/TernaryTree.java?rev=1.3&
>    content-type=text/vnd.viewcvs-markup
> for the current state of the art.
>
> An example how strings are looked up in a trie: Suppose the collection
> contains the key strings "ord", "ot" and "rda".
> The trie is roughly
>      (0) o  r
>         /    \
>   (1) r t    d (2)
>       | -    |
>   (3) d      a (4)
>       -      -
>
> Now for the string lookup
> i=0 j=0 c='w' no match in (0) -> continue
> i=1 j=0 c='o' match 'o' in (0) -> (1)
>      j=1 c='r' match 'r' in (1) -> (3)
>      j=3 c='d' match 'd' in (3), leaf node, get value
> i=2 j=0 c='r' match 'r' in (0) -> (2)
>      j=1 c='d' match 'd' in (2) -> (4)
>      eos but no leaf node -> continue
> i=3 j=0 c='d' no match (0) -> continue
>
> You see, none of the longer substrings starting with "w" is actually
> looked up, which should provide some performance gain over simply using
> a hashtable.
>
>   == Requirements ==
>
> The data structure is built once from mor or less randomly distributed key
> strings. After this, it is used for many lookups. This means the key
> strings can first be gathered and analyzed, then stored into an optimized
> data structure.
>
> The number of expected key strings is of the order of several thousend to
> a few hundred thousand. Efficient memory usage is a must. Tradeoffs
between
> using less memory and lookup performance should be carefully considered.
>
> It should be possible to make the optimized data structure persistent. It
> should load fast from a data or serialized object file. Unfortunately, I
> have no experience whether it is more advanteous to serialize a tree of
> Java objects or an object with a one or few big arrays.
>
>   == Considerations ==
>
> 1. As already said, for the current hyphenation patters no key string
> is a substring at the beginning of another key. I.e if there is a "wor"
> key, there won't be "word" or "wora" keys, but there may be a "kword"
> key. This may allow some storage optimizations. (see
>    http://cvs.apache.org/viewcvs.cgi/xml-fop/src/hyph/en.xml
>    ?rev=1.2&content-type=text/vnd.viewcvs-markup
> for an example, the keys are the strings in the <patterns> sections sans
> the digits)
> However for using such a data structure for stemming, there may be keys
> which are also at the satrt of other keys. Actually, the curent state for
> hyphenation patters is to some extent an artefact of how the patterns are
> created, and it may well be it means we are using suboptimal patterns.
>
> 2. In most cases, the set of unicode characters in the key strings is
limited.
> It is possible that the charaters can be mapped onto bytes. For
hyphenation,
> a mapping of the characters of the original string is necessary anyway
> (normalizing uppercase characters to lowercase etc.).
>
> Ok, now what do you think?
>
> Regards
> J.Pietschmann
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org