You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@directory.apache.org by ak...@apache.org on 2004/05/28 09:19:36 UTC
svn commit: rev 20535 - incubator/directory/snickers/trunk/xdocs/ber-codec
Author: akarasulu
Date: Fri May 28 00:19:35 2004
New Revision: 20535
Added:
incubator/directory/snickers/trunk/xdocs/ber-codec/BERDigesterDesign.xml
Modified:
incubator/directory/snickers/trunk/xdocs/ber-codec/index.xml
Log:
Commit changes ...
o added the digester design document - still need to touch this up
Added: incubator/directory/snickers/trunk/xdocs/ber-codec/BERDigesterDesign.xml
==============================================================================
--- (empty file)
+++ incubator/directory/snickers/trunk/xdocs/ber-codec/BERDigesterDesign.xml Fri May 28 00:19:35 2004
@@ -0,0 +1,326 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>BER Digester Design</title>
+ </properties>
+ <body>
+ <section name="Introduction">
+
+ <subsection name="What is it?">
+ <p>
+ The BERDigester is a high level decoder that builds object
+ containment trees from a train of TLV Tuple objects it receives via
+ BERDecoder callback events. It builds these containment trees
+ through the action of rules that are triggered by TLV Tag nesting
+ patterns.
+ </p>
+ </subsection>
+
+ <subsection name="Why is it called a digester?">
+ <p>
+ The idea for a higher level decoder that builds object containment
+ trees through the action of rules came when I was looking at the
+ Jakarta Commons (SAX event) Digester. SAX events are very similar
+ to the TLV callback events that are emminated by the BERDecoder.
+ The Commons Digester consumed these events. While processing these
+ events it tracked the nesting of XML elements and fired registered
+ rules when nesting patterns were matched. These rules performed
+ operation on the Digester's stack. Objects were instantiated and
+ pushed, or populated and then popped to build the containment tree.
+ </p>
+
+ <p>
+ The BERDigester applies the same pattern as does the Common's
+ Digester on a similar underlying event stream. Although BER Tuple
+ events carry completely different peices of information, like SAX
+ event streams BER TLV event streams signal the nesting pattern of
+ underlying constructs within the stream. This hence reviels the
+ structure of an ASN.1 data type or in the case of SAX the structure
+ of an XML document. Both digesters consume events emmination from
+ some event source to build object containment trees based on the
+ nesting pattern revieled by those events. The containment trees are
+ built through the action of rules triggered by matching nesting
+ patterns. Just like the Commons Digester, the BERDigester uses
+ stacks to share objects across rule firings.
+ </p>
+
+ <p>
+ It's safe to say that there is such a thing as a digester pattern.
+ What better way exists to refer to the pattern and those peices of
+ software implementing it, than to just call them digesters? I could
+ not think of any and that's why we call it the BERDigester. The BER
+ simply distinguishes it from the Commons Digester so there is no
+ confusion. (Note: BERDigester may be renamed to TupleDigester)
+ </p>
+ </subsection>
+
+ <subsection name="Rules and Stacks Are Good">
+ <p>
+ Our ultimate goal is to build a stub compiler that generates
+ encoder/decoder pairs for the various ASN.1 encodings: BER, DER, XER,
+ and PER. The compiler also generates the classes used by a digester
+ to build the containment tree of the ASN.1 data structures
+ transmitted. Generation of the stubs is not that hard to do.
+ However building a decoder that assembles containment trees using
+ those stubs and an encoded stream is not easy to do.
+ </p>
+
+ <p>
+ Stacks help out by reducing the complexity of the problem. A stack
+ can be used first of all to share objects across rule firings.
+ Secondly its a perfect data structure for tracking nesting patterns
+ or managing operations on object containment trees. State and scope
+ are maintained easily using a stack.
+ </p>
+
+ <p>
+ If simple stack based instructions are devised to instantiate stub
+ objects, push them and pop them, along with primitives and other
+ classes, then the complexity involved with generating decoders
+ diminishes greatly. Furthermore, if those instructions were
+ declaritive rules triggered based on nesting patterns then decoder
+ generation is well a cake walk. All a decoder needs to do is
+ register rules based on tag nesting patterns associated with the
+ ASN.1 modules with a BERDigester. A generic set of rules may be
+ created or rules can be generated to be registered. The generated
+ decoder would hence be the setup of the BERDigester by registering
+ rules with it for tag patterns. Code would also be added to funnel
+ data into the underlying BERDecoder as well as code to respond to
+ the callbacks generated by the top level PDU decoder.
+ </p>
+ </subsection>
+ </section>
+
+ <section name="BERDigester Design">
+ <subsection name="Nesting Stack">
+ <p>
+ The design documentation for the BERDecoder discusses an encoding
+ for Tags using the first four octets of a Tag field. These four
+ octets are stored within a primitive Java int. These ints represent
+ the raw Tag of a TLV tuple. We could push these raw Tag ints onto
+ an stack: let's for the time being call this primitive Java int stack
+ the tagStack. Every time a new TLV Tuple is delivered, the digester
+ pushes the Tag int onto the tagStack. When processing of the TLV
+ Tuple is finished, the tagStack is popped removing the Tag int for
+ the Tuple. The sequence of Tag ints within the tagStack represent
+ the reverse nesting order of TLV Tuples. Rule's are registered with
+ the digester using int[]s for patterns, these int[]s are compared
+ against the sequence of ints within the tagStack, hence the tag
+ nesting pattern. All rules registered with a matching pattern of
+ Tag ints are fired in the order they were registered.
+ </p>
+ <p>
+ This sounds pretty simple and straight forward at first glance but
+ there are complications. The BER encoding, provides for two ways in
+ which string types can be encoded: as a primitive TLV or as a
+ constructed TLV. By string types, we refer to all character based
+ string types like IA5String as well as bit strings and octet strings.
+ String types can be encoded using a single TLV as a primitive type.
+ This way the entire string is kept within the value field. The
+ designers of BER were smart enough to realize that value chunking
+ would need to occur for large strings to enable efficient decoding
+ and encoding. Hence these string types can optionally be encoded as
+ constructed TLV tuples with smaller nested tuples carrying fragments
+ of the string. This way encoders would not need to calculate the
+ length of a large strings in advance or keep the entire string in
+ memory. The value can be fragmented and chucked explicitly by the
+ encoding. This is great, however the Tag int for a primitive TLV of
+ a string type will not equal the Tag int for the constructed TLV of
+ the same string type. Pattern matching cannot occur under these
+ circumstances.
+ </p>
+ <p>
+ This complication forced me to rethink the use of raw tag integers.
+ Instead of pushing the tag int as is onto the tagStack we could
+ always reduce the raw tag to some canonical form then push that int
+ instead. Furthermore, all patterns supplied to register rules must
+ also have their ints scrubbed within the int[] pattern, so pattern
+ matching will occur using only the canonical representation of Tag
+ ints rather than actual Tag integers.
+ </p>
+ <p>
+ The primitive/constructed flag exists on a Tag's first octet as a
+ single bit. We want to prevent variations in this bit from throwing
+ off our pattern matching. Choosing a canonical form is like picking
+ whether or not to lower case or to upper case letters before
+ attempting a comparison. In our case :-), we have chosen the
+ canonical form to be the tag with this primitive/constructed bit
+ turned off. So before any comparison occurs all tag integers will
+ be guaranteed to have this bit flag turned off. In fact all TagEnum
+ int values prefabricated by the compiler will have the Tag int
+ scrubbed. Meaning the primitive/constructed bit will be in the off
+ position and hence in the canonical form.
+ </p>
+ </subsection>
+
+ <subsection name="Pattern Matching">
+ <p>
+ Without going into too much detail, matching a rule pattern to the
+ nesting of tag ints within the tagStack seems pretty simple.
+ Compare each int value at a position within both int sequences and
+ see if at any point there are differences. If there are differences
+ the pattern does not match the tag nesting pattern. If there are no
+ differences then its a match. The issues with this algorithm start
+ popping up when we consider performance and the number of rules that
+ must be compared within the digester's rule base (set of registered
+ rules). To calculate the matching rules for a nesting pattern we
+ would have to perform N comparisons where N is the number of
+ registered rules. These calculations would have to occur every time
+ a tag int is pushed onto the stack. If you ask me this brute force
+ approach is unacceptable.
+ </p>
+ <p>
+ To reduce the need to compute N comparisons we have built a special
+ data structure used to store, and rapidly search for all matching
+ rules. The data structure is a tree of TagNodes. Every TagNode
+ in the tree stores the canonical representation of a tag int and a
+ set of rules. TagNodes may have child TagNodes keyed by their tag
+ int value. A depth first tree walk is conducted to search for the
+ set of rules that are matched by the current nesting pattern. The
+ tree walk starts at the root of the tree using the Tag int value to
+ determine the child to lookup. The walk then continues with the
+ child and the next Tag int used to lookup the next child and so on.
+ If the walk consumes all tag ints within the stack and lands on a
+ TagNode then all the rules at that node have matched the nesting
+ pattern.
+ </p>
+ <p>
+ This tree data structure is built as rules are registered with the
+ digester. Each rule registered either adds a new path of TagNodes
+ to the tree, of adds itself to an existing node within the tree.
+ Once built the structure can be searched to determine the rules
+ matched by a nesting pattern. The data structure is small, cheap
+ and turns an order O(nm) search operation where n is the number of
+ rules an m is the current tag nesting depth into a O(m) search
+ operation which is much better.
+ </p>
+ <p>
+ The only problem I have at this point is how to implement matching
+ pattern wildcards used to register rules with. I have some ideas
+ on how this can be done though. First of all a special reserved tag
+ int value must be selected so it does not conflict with any valid
+ tags we might encounter. I think any of the UNIVERSAL tags above
+ 32 can be used for this but I would have to check. Presuming this
+ were known already there are four kinds of wildcard use cases which
+ I have outlined below:
+ </p>
+
+ <ol>
+ <li>wildcard at the front of the pattern</li>
+ <li>wildcard at the end of the pattern</li>
+ <li>wildcard in the middle of the pattern</li>
+ <li>two or more wildcards in the middle of the pattern</li>
+ </ol>
+
+ <p>
+ There is no way we would accomodate all these use cases. First off
+ some of them will be rarely if ever used. Secondly their
+ implementation would either be impossible or too difficult to
+ implement for the return in expressivity. At this point I plan to
+ implement any of these use cases on an as needed basis. However
+ the two noteworthy use cases that stick out are use case #1 and #2.
+ When handling fragmented strings encoded using constructed tuples
+ #2 will come in handy if we want a rule to fire off for each
+ fragment. Likewise #1 will be useful as well. Sometimes a pattern
+ may occur multiple times in different places and you want to detect
+ them all using a single rule. In this case, a pattern with a
+ wildcard at the front (case #1) does the job. Use cases #3 and #4
+ are very difficult to implement not to mention expensive in terms
+ of processing. They just don't seem worth it.
+ </p>
+
+ <p>
+ Now the question is how do we search with a wildcard in the front or
+ at the tail end of the pattern. Perhaps the easiest of the two to
+ handle would be a wildcard in the tail end of the pattern. A search
+ could be conducted using everything but the wildcard as if it were a
+ regular search without wildcards. If this lands on a TagNode then
+ all the rules of that node and that nodes children have matched been
+ matched by the pattern. The other use case, #1, is not as easy to
+ implement. For this another special TagTree could be built, however
+ it would be assembled in reverse order using only the tails of those
+ patterns starting with a wildcard. If for example *-4-8-9 and
+ *-6-9-1 patterns are used to register rules r1, and r2 respectively
+ then the root node would contain two child nodes one for a tag int
+ equal to 9 and another equal to 1. These two rule pattern would
+ result in two tree branches. Again when adding new patterns we reuse
+ what already exists before branching. To search these trees for a
+ match the reverse nesting pattern is used to walk the tree. Take
+ for example the pattern 1-8-9. This would start off at the root,
+ select the child node with a tag int equal to 9, then select the
+ child under node 9 with a tag int equal to 8. At this point we can
+ go no further. At this point a check is performed to see if the node
+ we are stuck at is a leaf node. If it is a leaf node then we have a
+ match for the rules at that node, if not then no rules were matched.
+ For 1-8-9 no rules are matched. Now use 1-6-4-8-9 as the nesting
+ pattern to test. Using the same algorithm we find ourselves stuck
+ walking the tree at node 4 which is a leaf node. In this case the
+ rules in node 4 have been matched by the nesting pattern. Keep in
+ mind that the search against the reverse TagTree is conducted in
+ addition to the search against the forward TagTree. Its easy to see
+ that this use case is far more painful to implement.
+ </p>
+ </subsection>
+
+ <subsection name="Composition Stacks">
+ <p>
+ When rules fire they do something either to a stack or to an object
+ on a stack. Either way rules pop, push or peek onto digester stacks
+ to do their bidding. These stacks are used to build the final
+ product which is a containment tree and hence are referred to as
+ composition stacks.
+ </p>
+ <p>
+ The digester contains several stacks. It contains a primitive Java
+ type stack for every Java primitive type. There is a stack for
+ booleans, bytes, chars, shorts, ints, longs, floats, and doubles.
+ In addition to these stacks there is another Object stack. Multiple
+ primitive stacks were used to avoid having to wrap primitive types
+ in their object wrappers just to push them onto an Object stack.
+ Some might think this is excessive, and others may think it
+ increases complexity. If the use of a single Object stack is
+ thought through then it becomes apparent that neither is the case.
+ </p>
+ <p>
+ If a single Object stack were to be used then rules dealing with
+ primitive types would need to wrap them with their respective Object
+ wrappers before pushing them onto the stack. When popping Objects
+ from the stack rules must know what type the popped object is
+ expected to be. If multiple stacks were used it would be much
+ easier to just get the primitive from another stack rather than
+ popping the Object stack and casting the popped Object to the Object
+ wrapper just to extract the primitive value. Having a primitive
+ stack for each Java primitive type is a PITA for the developer, but
+ it simplifies things for the user. Plus theres the advantage of not
+ having to wrap and extract primitive types just to pass them around
+ in the stack.
+ </p>
+ </subsection>
+
+ <subsection name="Decoder Chaining/Stacking">
+ <p>
+ If one looks closely the digester contains a BERDecoder member. This
+ decoder is the source of all TLV Tuple events. The digester is
+ designed as a decoder itself that feeds incoming byte buffers to this
+ underlying BERDecoder. An inner class, DigesterCallback, is used to
+ implement a BERDecoderCallback. This inner class bridges BERDecoder
+ TLV Tuple events delivering them to the digester. The digester in
+ turn updates its tag nesting stack, and processes the new TLV
+ announced with the event. The digester detects the end of a
+ Protocol Data Unit (PDU) when it pops the last TLV tuple off of the
+ tagStack. When this happens the constructed object which should
+ have been popped off of the Object stack is returned using a
+ callback as the root of the containment tree.
+ </p>
+
+ <p>
+ The digester is a great example of how decoders can be chained or
+ stacked together in almost a linear combination. It's like stacking
+ filters on top of one another.
+ </p>
+ </subsection>
+
+ </section>
+ </body>
+</document>
Modified: incubator/directory/snickers/trunk/xdocs/ber-codec/index.xml
==============================================================================
--- incubator/directory/snickers/trunk/xdocs/ber-codec/index.xml (original)
+++ incubator/directory/snickers/trunk/xdocs/ber-codec/index.xml Fri May 28 00:19:35 2004
@@ -57,6 +57,11 @@
</tr>
<tr>
+ <td><a href="./BERDigesterDesign.html">BER Digester Design</a></td>
+ <td>Explains how and why the BERDigester was designed</td>
+ </tr>
+
+ <tr>
<td><a href="./BEREncoderUserGuide.html">BER Encoder User Guide</a></td>
<td>Describes how to use the BEREncoder to generate a TLV stream</td>
</tr>