You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@directory.apache.org by ak...@apache.org on 2005/02/21 22:44:49 UTC

svn commit: r154725 [1/2] - in incubator/directory/asn1/branches/rewrite: ber/xdocs/ codec/xdocs/ stub-compiler/xdocs/

Author: akarasulu
Date: Mon Feb 21 13:44:46 2005
New Revision: 154725

URL: http://svn.apache.org/viewcvs?view=rev&rev=154725
Log:
missed the moved files because I did not merge but ran a patch diff of trunk - my bad

Added:
    incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/eureka.xml
    incubator/directory/asn1/branches/rewrite/ber/xdocs/index.xml
    incubator/directory/asn1/branches/rewrite/codec/xdocs/
    incubator/directory/asn1/branches/rewrite/codec/xdocs/index.xml
    incubator/directory/asn1/branches/rewrite/stub-compiler/xdocs/
    incubator/directory/asn1/branches/rewrite/stub-compiler/xdocs/index.xml

Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,552 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+  <properties>
+    <author email="akarasulu@apache.org">Alex Karasulu</author>
+    <title>BER Decoder Design</title>
+  </properties>
+
+  <body>
+    <section name="Factors Driving the Design">
+      <p>
+        Several factors drove the design.  Some of these were covered in
+        advance in the section on stateful codecs.  While implementing these
+        stateful decoders we still must adhere to some rules to not undermine
+        the purpose for implementing stateful decoders in the first place.
+        Other design decisions come from the way the BER encoding is devised.
+        These driving factors are all covered here in this document to reveal
+        the present design while justifing it and the decisions that forged
+        it.
+      </p>
+
+      <subsection name="TLV Nesting">
+        <p>
+          A constructed TLV tuple contains other TLV tuples nested within the
+          Value field, and so on, recursively.  The nesting depth is usually
+          indefinate and dependent on the data structures being transformed.
+          This fact drastically affects the way the decoder is designed and
+          hence operates.
+        </p>
+
+        <p>
+          It's always good form to tackle the basis case[s] of any recursive
+          problem.  The simplest basis case is the processing of a simple
+          primitive TLV tuple.  So for the time being presume we are simply
+          implementing a primitive BER TLV decoder without any nesting.
+        </p>
+
+        <p>
+          While decoding a primitive tuple the decoder must maintain state.
+          State for this simple decoder is an indicator representing whether
+          the Tag, Length, or Value was being processed at the end of the last
+          chunk of substrate data.  This is required along with any data
+          already captured for these fields to continue processing where the
+          last chunk left off.  Value field data actually is not "accumulated"
+          to maintain state, but for the time being we'll ignore that fact.
+        </p>
+
+        <p>
+          The primitive TLV tuple decoder is easily designed.  Build three
+          stateful sub-decoders for each field, the Tag, Length, and the Value
+          fields.  A top level decoder deligates the processing state of the
+          fields to each sub-decoder and switches sub-decoders when the
+          indicator changes from Tag, to Length, to Value then back to Tag
+          and so on.  When a Value completes processing and before another Tag
+          is read the decoder triggers a callback event.  Here's what the
+          hypothetical primitive TLV tuple decoder would look like:
+        </p>
+
+        <center>
+          <img src="../images/PrimitiveTupleDecoder-uml.gif"/>
+        </center>
+
+        <p>
+          Now let's try to figure out how to handle constructed TLV tuples
+          which recursively nest other tuples indefinately.  State now
+          is more that just where you left off in the current tuple being
+          processed.  When processing an inner tuple the decoder must also know
+          where it left off in the outter tuple to resume processing.  More
+          accurately the decoder must maintain the state of every parent and
+          ancestor tuple of the current tuple being processed in the same
+          manner the TLV tuple decoder did for primitive TLV tuples.  Hence
+          the state of the decoder is a stack of all ancestor TLV tuple states
+          as well as the state of the tuple currently being processed.
+        </p>
+
+        <p>
+          While processing input for a nested TLV tuple the state of all
+          tuple ancestors must also be updated with the same data so the
+          decoder can determine when their processing is about to complete.
+          This way the decoder does not read TLV tuples adjacent to the
+          constructed TLV tuple, incorrectly presuming that they are part
+          of the constructed TLV tuple.
+        </p>
+
+        <p>
+          When the last inner tuple in a constructed TLV tuple completes, it
+          triggers a callback for itself, then the stack is popped and another
+          callback event is triggered for the completion of the constructed
+          TLV tuple.
+        </p>
+
+        <p>
+          In conclusion, the state of a BER decoder, used to process both
+          primitive and constructed TLV tuples, must take into accout the
+          the processing state of every tuple ancestor in the stack of nested
+          tuples.  Otherwise state cannot be maintain.  Just how this is
+          efficiently managed is the topic of the next few subsections.
+        </p>
+      </subsection>
+
+      <subsection name="Value Processing">
+        <p>
+          If the decoder accumulates encountered Value field octets to maintain
+          state then we have a problem.  First off the size of the Value could
+          be massive and often varies.  We want to maintain a fixed maximum
+          memory footprint to the decoder.  This goes out the window if Value
+          field content is accumulated within buffers to maintain tuple
+          processing state.  Furthermore, with nesting, every ancestor tuple in
+          the nesting stack would maintain a copy of the topmost tuple's Value
+          field when that tuple is about to complete processing.  The number of
+          copies is a function of the nesting depth, so the deeper the nesting,
+          the more memory is wastefully consumed.  This is totally unacceptable
+          and it undermines the reason for devising stateful codecs in the
+          first place.
+        </p>
+
+        <p>
+          To avoid this problem we must not accumulate Value field octets
+          (bytes) while maintaining state.  Unlike the Value field, the other
+          Tag and Length fields are limited and often account for only a few
+          bytes within TLV tuples.  To maintain state however the decoder still
+          has to perform some accounting to determine when outter tuples in the
+          nesting stack complete processing.  The decoder maintains state by
+          using Value byte counters processed rather than using an accumulator
+          to store the Value bytes.  This way the decoder can compare a tuple's
+          Length field with it's Value byte counter to determine if processing
+          is complete.
+        </p>
+      </subsection>
+
+      <subsection name="Extending DecoderCallback">
+        <p>
+          The next question is how the decoder propagates TLV tuple Values to
+          the target receiving the TLV tuple stream?  If the standard
+          <code>DecoderCallback.decodeOccurred()</code> method is designed to
+          be called upon TLV processing completion how do we avoid collecting
+          the Value while getting the Value bytes somehow to an decoder user
+          via callbacks?
+        </p>
+
+        <p>
+          The answer is to use yet another callback.  The DecoderCallback
+          interface is extended by actually adding three extra callback
+          methods: one for each field.  The BERDecoderCallback interface
+          extends the DecoderCallback interface and adds the following
+          methods:
+        </p>
+
+        <ul>
+          <li>void tagDecoded( Tuple tlv )</li>
+          <li>void lengthDecoded( Tuple tlv )</li>
+          <li>void partialValueDecoded( Tuple tlv )</li>
+        </ul>
+
+        <p>
+          The following diagram shows the decoder interfaces and a do nothing
+          adapter implementation for convenience:
+        </p>
+
+        <center>
+          <img src="../images/BERDecoderCallback.gif"/>
+        </center>
+
+        <p>
+          For a single TLV all methods except for the partialValueDecoded()
+          method is invoked at most once.  As its name suggests peices of
+          Value are delivered encapsulated by the Tuple argument rather than
+          the entire Value field.  Hence the method can be called zero or more
+          times while processing a TLV.
+        </p>
+
+        <p>
+          The extended decoder callback interface allows the decoder to chunk
+          Value fields and hence maintain a fixed maximum footprint.  The
+          partialValueDecoded callback is a bit misleading however.  It really
+          does not decode any Value bytes based on the Tag of the TLV.  It
+          simply hands off the raw value bytes to the callback, this part of
+          the decode is left to higher level decoders built on top of the
+          BERDecoder.  However all primitive type decode operations are
+          provided by the BER codec.
+        </p>
+      </subsection>
+
+      <subsection name="Constructed Values">
+        <p>
+          The values of constructed TLV tuples are other tuples.  Their Values
+          are already decoded by the BER decoder which triggers TLV events for
+          the nested TLV tuples.  Calls to the partialValueDecoded() method
+          hence are never made.  Furthermore the decoder transits from the
+          Length state of processing to the Tag state just after completing
+          the decode of a constructed tuple Length field.  This is because the
+          next tuple to process is a nested tuple with its Tag following the
+          constructed tuple's Length field.
+        </p>
+
+        <p>
+          Constructed TLV tuples never have partialValueDecoded() called.  Only
+          primitive TLV tuples have Value octets delivered to this callback.
+          This makes state handling withing the decoder a bit tricky but the
+          complexity for the rewards wreaked is well worth it.
+        </p>
+      </subsection>
+    </section>
+
+    <section name="State Management">
+      <p>
+        There are two parts to managing state for the BERDecoder: stack
+        based ancestor Value accounting state, and current tuple processing
+        state.
+      </p>
+
+      <subsection name="Current TLV State Management">
+        <p>
+          While processing the tuple in scope a type safe enumeration is used
+          to track the current tuple processing state which could be in either
+          Tag, Length, or Value processing states.  Subordinate decoders are
+          used to decode the Tag, and Length fields.  The Value field unlike
+          these fields does not have a corresponding decoder: it does not need
+          one since primitives TLV Values are not decoded but returned raw.
+          The sub-decoders for Tag and Length manage the accumulation of field
+          bytes between chunked decode operations.  The following diagram
+          displays the helper classes used to manage the current TLV processing
+          state:
+        </p>
+
+        <center>
+          <img src="../images/state-helper-classes.gif"/>
+        </center>
+
+        <table>
+          <tr><th>Color</th><th>Group</th></tr>
+          <tr><td>Red</td><td>Generic Helper Classes</td></tr>
+          <tr><td>Yellow</td><td>Tag Specific Classes</td></tr>
+          <tr><td>Purple</td><td>Length Specific Classes</td></tr>
+        </table>
+
+        <p>
+          The tag specific classes include the Tag and TagDecoder classes.
+          The Tag class handles the extraction of various Tag embedded fields
+          like the constructed bit and the tag's type class.  It also collects
+          tag octets up to 4 octets only using a special TagOctetCollector.
+          The TagDecoder is just a stateful decoder implementation wrapper
+          using Tag methods.
+        </p>
+
+        <p>
+          The TypeClass class is a type safe enumeration of the four type
+          classes of Tags:
+        </p>
+
+        <ul>
+          <li>UNIVERSAL</li>
+          <li>APPLICATION</li>
+          <li>CONTEXT_SPECIFIC</li>
+          <li>PRIVATE</li>
+        </ul>
+
+        <p>
+          Once the Tag accumulator collects all tag octets it determines and
+          sets the TypeClass corresponding to the tag.
+        </p>
+
+        <p>
+          The TagEnum class is an abstract base class for type safe tag
+          enumerations.  This special type safe enumeration associates a tag
+          label with two integers: the tag value and the tag id.  The tag value
+          is an integer representation of the tag whereas the id is just the
+          just the id field of the tag.  This btw is the main reason why the
+          TagCollector only accepts four bytes for building the tag: an integer
+          is essentially used as the backing store for the tag data.  The
+          reasons for this are explained within the tag handling section to
+          follow.
+        </p>
+
+        <p>
+          The Length and LengthDecoder operate very much in the same fashion
+          as do the Tag and TagDecoder.  The same pattern is applied to both
+          pairs of classes.  The primary difference is the use of a ByteBuffer
+          within the Length class rather than a custom data structure like the
+          TagOctetCollector to accumulate Length bytes (octets).  The main
+          reason for this is that a limit of 4 tag octets have been imposed on
+          the decoder which in fact is contrary to the BER specification.
+          Length values well above the 4 byte integer are surely possible for
+          TLV values although improbable.
+        </p>
+
+        <p>
+          The BERDecoderState class is another type safe enumeration with the
+          following values: TAG, LENGTH and VALUE.  It obviously represents the
+          processing state of the TLV tuple currently in scope.
+        </p>
+      </subsection>
+
+      <subsection name="Stack State Management">
+        <p>
+          The BERDecoder UML is displayed below to show some of the memebers
+          and operations available.  Pay special attention to the tlvStack
+          member and the getTupleStack() package friendly Stack accessor used
+          for stack state management:
+        </p>
+
+        <center>
+          <img src="../images/BERDecoder.gif"/>
+        </center>
+
+        <p>
+          The tlvStack is a Stack of Tuple instances.  The last subsection
+          contains a UML diagram with the Tuple class.  Tuple objects are the
+          objects handed to the the decodeOccurred() method of the callback.
+          They basically encapsulate a bunch of information associated with a
+          TLV tuple in one object.  This includes accounting information used
+          to determine the processing state of constructed TLVs.  The Stack of
+          Tuples hence stores the state information associated with ancestor
+          Tuples currently out of scope.
+        </p>
+
+        <p>
+          With every chunk of substrate processed for the tuple currently in
+          scope, the accounting information in every Tuple of the Stack is
+          updated.  Again, this tracks how much of the anscestor's Value field
+          has been processed.  Specifically the length and index fields of
+          Tuple objects are used to determine how much of the TLV has been
+          read.
+        </p>
+      </subsection>
+    </section>
+
+    <section name="Tuple Recycling">
+
+      <subsection name="TLV Tuple Density">
+        <p>
+          BER TLV streams will contain varying densities of TLV tuples.  The
+          density of the tuples depends on the nature of the content.  Streams
+          with many small primitive types crammed together will generate TLV
+          tuples very rapidly while processing the encoded substrate.  Every
+          few, even couple bytes might produce a new tuple.
+        </p>
+
+        <p>
+          If we instantiated a new Tuple instance and populated it for every
+          few bytes in the stream, then performance will degrade significantly
+          while processing streams with high TLV tuple densities.  Futhermore
+          rapid object creation rates would seriously tax the garbage
+          collector.  To avoid these negative effects of instantiating new TLV
+          tuples we need to reuse the same Tuple allowing interested parties
+          to either clone or copy the contained information while processing
+          the tuple.  More often than not, most tuples will be ignored.  It
+          would be wasteful to create a new Tuple object for every TLV tuple
+          encountered when some or most might potentially be ignored.
+        </p>
+      </subsection>
+
+      <subsection name="Problem With Recycling a Tuple">
+        <p>
+          If we avoid instantiating new TLV Tuples and resuse the same Tuple
+          object, we run into a problem.  First we'll loose data when we
+          attempt to push the tuple onto the tlvStack when the next TLV is
+          processed.
+        </p>
+
+        <p>
+          One solution to this problem is to clone constructed Tuples before
+          pushing the tuple onto the tlvStack.  Hence only primitives would
+          reuse the same Tuple.  This works well because primitive tuple data
+          does not need to be maintained past its scope.  If the data needs to
+          be copied, it can be copied by the application using the decoder.
+          This makes sense since the application determines which Tuple values
+          to store or ignore.
+        </p>
+
+        <p>
+          This solves performance bottlenecks with substrates that are dense
+          with primitive tuples.  However the problem will re-emerge if the
+          substrate is dense with deeply nested primitive tuples.  If every
+          primitive is embedded deep within its own cavern of nested TLV
+          tuples then we'll be closer to instantiating a Tuple object for
+          almost every TLV encountered.  The perfect substrate under this
+          scheme, of course, would be a single primitive element but beyond
+          that it would be flat nesting patterns where as many primitives TLV
+          tuples are stuffed into every contstructed TLV tuple as possible.
+        </p>
+
+        <p>
+          The deeply embedded high density of constructed TLV tuples is highly
+          unlikely although possible for recursive ASN.1 definitions.
+          Regardless of these situations producing a high density of
+          constructed TLV tuples, the nesting structures will often share the
+          same parents so the TLV tuple to Tuple object instantiation ration
+          would rarely approach 1:1.
+        </p>
+
+        <p>
+          Over all we cannot determine the ratio of constructed to primitive
+          TLV tuples encountered within a substrate.  However one would like
+          to believe that complex structures do not predominate, and that
+          protocol designers opt for simpler structures whenever possible.
+          With this sensible hypothesis reuse of primitive TLV tuples and the
+          cloning of constructed TLV tuples seems like a viable strategy for
+          managing excessive object instantiations.
+        </p>
+      </subsection>
+
+    </section>
+
+    <section name="Integer Representation For Tags">
+      <p>
+        According to the BER encoding specification, X.690, a Tag id can be
+        any arbitrary value: there is no limitation to the size of an id.
+        In practice ids are claimed incrementally by ASN.1 modules from the
+        CONTEXT_SPECIFIC and APPLICATION type classes.  These values for
+        any reasonable protocol are far less than 100 ids.  Experts like
+        Larmouth claim they have never seen Tag ids larger than a thousand.
+        So we don't bother representing Tags within a buffer for the full
+        expressivity of the specification when we know of reasonable soft
+        limits to the Tag id.
+      </p>
+
+      <subsection name="Four Tag Octets Are Enough">
+        <p>
+          In most cases, one or two octets suffice for encoding a tag and its
+          identifier.  In some cases three bytes may rarely be used.  It's
+          highly improbable that we'll ever see four or more bytes to be used
+          to encode a tag: even the experts have never seen this before.
+          The only way I can conceive of this is if computers begin devising
+          or generating protocols :-).
+        </p>
+
+        <p>
+          According to the specification the long form can encode the
+          following maximum identifier sizes with various octet lengths:
+        </p>
+
+        <table>
+          <tr>
+            <th>Octets</th>
+            <th>Maximum Tag Id</th>
+            <th>Calculation</th>
+          </tr>
+
+          <tr>
+            <td>1</td>
+            <td>30</td>
+            <td>2^5-1</td>
+          </tr>
+
+          <tr>
+            <td>2</td>
+            <td>127</td>
+            <td>2^7-1</td>
+          </tr>
+
+          <tr>
+            <td>3</td>
+            <td>16,383</td>
+            <td>2^14-1</td>
+          </tr>
+
+          <tr>
+            <td>4</td>
+            <td>2,097,151</td>
+            <td>2^21-1</td>
+          </tr>
+
+          <tr>
+            <td>5</td>
+            <td>268,435,455</td>
+            <td>2^28-1</td>
+          </tr>
+        </table>
+
+        <p>
+          As we can see 3-4 octets encode a maximum tag id we can live with.
+          One might expect the max tag id for say 4 octets would be 2^(4*8)-1
+          but its not.  We loose some bits, to be able to encode a variable
+          length tag with the long form.  In the long form all the bits from
+          the first octet are wasted and a bit from each octet there after is
+          lost to be able to terminate the tag field.  Hence if we started out
+          with 4 bytes or 32 bits then we're actually using 32-8-3 or 21 of
+          the original bits for storing the value of an id.  This yeilds a max
+          id value of 2^21-1 for 32 bits or 4 octets.
+        </p>
+      </subsection>
+
+      <subsection name="Integer Encoding">
+        <p>
+          Tags are used to match for TLV tuples. Nothing matches faster than an
+          integer using a switch statement.  It makes sense to store and
+          manage raw Tag octets within the bytes of a primitive Java integer
+          rather than within a byte buffer.  This way switch statements can be
+          used to quickly match TLV tuples based on their integer encoding for
+          the first four tag bytes.  Furthermore the stub compiler can
+          prefabricate type safe enums whose values equal the integer encoding
+          of a tag's four octets.  Matching for TLV tuples by tag then is as
+          fast as it can get using this integer encoding.  This btw is the sole
+          reason why we have the abstract class, TagEnum, which extends
+          ValuedEnum.  It's a type safe enumeration for Tag octets encoded as
+          integers.
+        </p>
+
+        <p>
+          Encoding only the four octets of the raw tag, limits the maximum
+          value of the id that a TLV's tag field can represent to 2^21-1.
+          This was the reason for the discussion in the section above.  We
+          simply will not need an id bigger than this.  So we decided to
+          break with the specification and restrict the max value of the tag
+          id down to 2^21-1 rather than leave it unbounded.
+          This limitation allows us to represent the first four octets of the
+          tag field as an integer thereby speeding up TLV pattern matching
+          considerably.
+        </p>
+
+        <p>
+          The TagOctetCollector is specifically designed to accumulate the four
+          octets of the Tag.  It stores the first octet in the
+          most significant byte of the int, the second in the next most
+          significant and so on until the last of the four octets is stored
+          within the least significant byte of the integer.  The diagram below
+          shows just how 4 bytes are assembled into the integer:
+        </p>
+
+        <center>
+          <img src="../images/tag-integer-encoding.png"/>
+        </center>
+
+        <p>
+          Note that if their were only 3 tag octets collected, then the bits
+          for Octet 4 would all be zero: bits 0-7 in the integer would be
+          zeros.  Likewise if only one octet were used then bits 23-0 would
+          be zero'd out within the 32-bit integer.
+        </p>
+
+        <p>
+          The integer encoding for tags are not leveraged here at the level of
+          the BERDecoder.  At this low level the decoder does not care about
+          tags other than those in the UNIVERSAL type class reserved for
+          detecting TLV tuple termination sequences within the stream.  Later
+          within the BERDigester where Tag pattern matching is used to make
+          sense of these TLV tuple streams, the integer encoding and the
+          TagEnum are used heavily.  Rather than add more complexity to the
+          BERDecoder we stop here and build upon it by stacking another
+          decoder, the BERDigester on top.  The BERDecoder decodes encoded
+          substrate streams into TLV Tuples and announces their arrival by
+          callbacks which are recieved by the BERDigester.  It is then upto
+          the BERDigester to process these TLV tuples using decoding rules
+          triggered by tag nesting patterns.  How approapriate!  Data encoded
+          using Basic Encoding Rules is decoded using rules that process a TLV
+          tuple stream.  More information regarding the design of the
+          BERDigester can be found <a href="BERDigesterDesign.html">here</a>.
+        </p>
+      </subsection>
+    </section>
+  </body>
+</document>

Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,11 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+  <properties>
+    <author email="akarasulu@apache.org">Alex Karasulu</author>
+    <title>BERDecoder Usage</title>
+  </properties>
+  <body>
+    <section name="Coming soon ... ">
+    </section>
+  </body>
+</document>

Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,474 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+  <properties>
+    <author email="akarasulu@apache.org">Alex Karasulu</author>
+    <title>BER Digester Design</title>
+  </properties>
+  <body>
+    <section name="Introduction">
+
+      <subsection name="What is it?">
+        <p>
+          The digester is a high level decoder that builds object
+          containment trees from a stream of events it receives via callback
+          events.  These callbacks notify of the start and end of BER tuples
+          composed of a Tag, a Length and a Value within the underlying stream.
+          Rules, registered with Tag nesting patterns are used to build
+          containment trees using an object stack.  Rules are triggered when
+          Tag nesting patterns match the patterns registered with the rule.
+        </p>
+      </subsection>
+
+      <subsection name="Why is it called a digester?">
+        <p>
+          The idea for a higher level decoder that builds object containment
+          trees through the action of rules came when I was looking at the
+          Jakarta Commons (SAX event) Digester.  SAX events are very similar
+          to the TLV callback events that are emminated by the BERDecoder.
+          The Commons Digester consumed these events.  While processing these
+          events it tracked the nesting of XML elements and fired registered
+          rules when nesting patterns were matched.  These rules performed
+          operations on the Digester's stack: objects were instantiated and
+          pushed, or populated and then popped to build the containment tree.
+        </p>
+
+        <p>
+          The BERDigester applies the same pattern as does the Common's
+          Digester on a similar underlying event stream.  Although BER Tuple
+          events carry completely different peices of information, like SAX
+          event streams BER TLV event streams signal the nesting pattern of
+          underlying constructs within the stream.  This hence reviels the
+          structure of an ASN.1 data type or in the case of SAX the structure
+          of an XML document.  Both digesters consume events emminating from
+          some event source to build object containment trees based on the
+          nesting pattern revieled by those events.  The containment trees are
+          built through the action of rules triggered by matching nesting
+          patterns.  Just like the Commons Digester, the BERDigester uses
+          stacks to share objects across rule firings.
+        </p>
+
+        <p>
+          It's safe to say that there is such a thing as a digester pattern.
+          What better way exists to refer to the pattern and those peices of
+          software implementing it, than to just call them digesters?  I could
+          not think of any and that's why we call it the BERDigester.  The BER
+          simply distinguishes it from the Commons Digester so there is no
+          confusion. (Note: BERDigester may be renamed to TupleDigester)
+        </p>
+      </subsection>
+
+      <subsection name="Rules and Stacks Are Good">
+        <p>
+          Our ultimate goal is to build a stub compiler that generates
+          encoder/decoder pairs for the various ASN.1 encodings: BER, DER, XER,
+          and PER.  The compiler also generates the classes used by a digester
+          to build the containment tree of the ASN.1 data structures
+          transmitted.  Generation of the stubs is not that hard to do.
+          However building a decoder that assembles containment trees using
+          those stubs and an encoded stream is not easy to do.
+        </p>
+
+        <p>
+          Stacks help out by reducing the complexity of the problem.  A stack
+          can be used first of all to share objects across rule firings.
+          Secondly its a perfect data structure for tracking nesting patterns
+          or managing operations on object containment trees.  State and scope
+          are maintained easily using a stack.
+        </p>
+
+        <p>
+          If simple stack based instructions are devised to instantiate stub
+          objects, push them and pop them, along with primitives and other
+          classes, then the complexity involved with generating decoders
+          diminishes greatly.  Furthermore, if those instructions were
+          declaritive rules triggered based on nesting patterns then decoder
+          generation is well a cake walk.  All a decoder needs to do is
+          register rules based on tag nesting patterns associated with the
+          ASN.1 modules with a BERDigester.  A generic set of rules may be
+          created or rules can be generated to be registered.  The generated
+          decoder would hence be the setup of the BERDigester by registering
+          rules with it for tag patterns.  Code would also be added to funnel
+          data into the underlying BERDecoder as well as code to respond to
+          the callbacks generated by the top level PDU decoder.
+        </p>
+      </subsection>
+    </section>
+
+    <section name="BERDigester Design">
+      <subsection name="Nesting Stack">
+        <p>
+          The design documentation for the BERDecoder discusses an encoding
+          for Tags using the first four octets of a Tag field.  These four
+          octets are stored within a primitive Java int.  These ints represent
+          the raw Tag of a TLV tuple.  We could push these raw Tag ints onto
+          an stack: let's for the time being call this primitive Java int stack
+          the tagStack.  Every time a new TLV Tuple is delivered, the digester
+          pushes the Tag int onto the tagStack.  When processing of the TLV
+          Tuple is finished, the tagStack is popped removing the Tag int for
+          the Tuple.  The sequence of Tag ints within the tagStack represent
+          the reverse nesting order of TLV Tuples.  Rule's are registered with
+          the digester using int[]s for patterns, these int[]s are compared
+          against the sequence of ints within the tagStack, hence the tag
+          nesting pattern.  All rules registered with a matching pattern of
+          Tag ints are fired in the order they were registered.
+        </p>
+        <p>
+	        This sounds pretty simple and straight forward at first glance but
+          there are complications.  The BER encoding, provides for two ways in
+          which string types can be encoded: as a primitive TLV or as a
+          constructed TLV.  By string types, we refer to all character based
+          string types like IA5String as well as bit strings and octet strings.
+          String types can be encoded using a single TLV as a primitive type.
+          This way the entire string is kept within the value field.  The
+          designers of BER were smart enough to realize that value chunking
+          would need to occur for large strings to enable efficient decoding
+          and encoding.  Hence these string types can optionally be encoded as
+          constructed TLV tuples with smaller nested tuples carrying fragments
+          of the string.  This way encoders would not need to calculate the
+          length of a large strings in advance or keep the entire string in
+          memory.  The value can be fragmented and chucked explicitly by the
+          encoding.  This is great, however the Tag int for a primitive TLV of
+          a string type will not equal the Tag int for the constructed TLV of
+          the same string type.  Pattern matching cannot occur under these
+          circumstances.
+        </p>
+        <p>
+	        This complication forced me to rethink the use of raw tag integers.
+          Instead of pushing the tag int as is onto the tagStack we could
+          always reduce the raw tag to some canonical form then push that int
+          instead.  Furthermore, all patterns supplied to register rules must
+          also have their ints scrubbed within the int[] pattern, so pattern
+          matching will occur using only the canonical representation of Tag
+          ints rather than actual Tag integers.
+        </p>
+        <p>
+	        The primitive/constructed flag exists on a Tag's first octet as a
+          single bit.  We want to prevent variations in this bit from throwing
+          off our pattern matching.  Choosing a canonical form is like picking
+          whether or not to lower case or to upper case letters before
+          attempting a comparison.  In our case :-), we have chosen the
+          canonical form to be the tag with this primitive/constructed bit
+          turned off.  So before any comparison occurs all tag integers will
+          be guaranteed to have this bit flag turned off.  In fact all TagEnum
+          int values prefabricated by the compiler will have the Tag int
+          scrubbed.  Meaning the primitive/constructed bit will be in the off
+          position and hence in the canonical form.
+        </p>
+      </subsection>
+
+      <subsection name="Pattern Matching">
+        <p>
+          To trigger rules the digester checks every new tag nesting pattern
+          to see if it matches any of the patterns registered with a rule.
+          Multiple patterns may be registered with a rule to allow it to fire
+          under different contexts.  With every in coming event which changes
+          the tag nesting pattern, the digester must see if the tag nesting
+          pattern matches any of the patterns for every registered rule.  This
+          could be an expensive operation to have to perform with every event
+          if the rule and pattern base is large.  Every registered rule pattern
+          would have to be tested for equality with the tag nesting pattern to
+          determine if the corresponding rule needs to fire.
+        </p>
+
+        <p>
+          To keep pattern matching overheads constant regardless of the number
+          of rules registered, a special data structure, a search tree called
+          the TagTree is used.  The tree is composed of TagNodes and is built
+          as rules and their patterns are registered.  The tree's tag nodes
+          simply correspond to a tag value, and have a list of rules.  Tag
+          nodes maintain a hash of children keyed by their tag value.  Here's
+          an example of a TagTree for the following rule pattern registrations:
+        </p>
+
+        <table>
+          <tr>
+            <th>Rule</th>
+            <th>Pattern</th>
+          </tr>
+
+          <tr>
+            <td>R0</td>
+            <td>[1,2,3,4,5]</td>
+          </tr>
+
+          <tr>
+            <td>R1</td>
+            <td>[1,2,5]</td>
+          </tr>
+
+          <tr>
+            <td>R2</td>
+            <td>[1,2,3]</td>
+          </tr>
+
+          <tr>
+            <td>R3</td>
+            <td>[4,4,5]</td>
+          </tr>
+
+          <tr>
+            <td>R0</td>
+            <td>[4,1,2,4]</td>
+          </tr>
+        </table>
+
+        <br></br>
+
+        <center>
+          <img src="../images/TagTree.png"/>
+        </center>
+
+        <p>
+          Whenever possible nodes in the tree are reused while adding new
+          patterns and rules to the tree.  Each new pattern added to the tree
+          corresponds to a new unique path within the tree.  Rule registration
+          is where the true cost of the tree's construction is incurred.  In
+          the solid state the tree is only used to rapidly search for the set
+          of rules to fire based on the current nesting pattern.  The search
+          occurs by walking the tree with the current nesting pattern.  For
+          example a walk using the [1,2,3] nesting pattern takes us from the
+          root to tag node 1, then 2 and finally tag node 3.  Once the pattern
+          is exhausted, meaning all tag values are consumed in the walk, the
+          rules at the final node are returned as the matching set that need to
+          be fired.  If there are no rules at that node, then the set is empty.
+          If the pattern "walks off the tree", where no child with a tag value
+          is available to continue the walk on the tree, the search ends
+          returning the empty set.
+        </p>
+
+        <p>
+          Without the tag tree every pattern registered with a rule would have
+          to be compared.  This is an O(nm) operation where n is the number of
+          pattern registrations and m is the size of the pattern.  With the tag
+          tree the search time is only O(m) where only the size of the pattern
+          determine the search time.  This is much more acceptable.
+        </p>
+
+        <p>
+          The only remaining problem at this point is how to implement wild
+          cards with pattern matching.  First of all a special reserved tag
+          int value must be selected as the wild card ("*") value so it does
+          not conflict with any valid tags we might encounter.  I think any of
+          the UNIVERSAL tags above 32 can be used for this but I would have to
+          check.  Plus there are four kinds of wildcard use cases which are
+          possible:
+        </p>
+
+        <ol>
+          <li>wildcard at the front of the pattern</li>
+          <li>wildcard at the end of the pattern</li>
+          <li>wildcard in the middle of the pattern</li>
+          <li>two or more wildcards in a pattern</li>
+        </ol>
+
+        <p>
+          There is no way we would accomodate all these use cases.  First off
+          some of them will be rarely if ever used.  Secondly their
+          implementation would either be impossible or too difficult to
+          implement for the return in expressivity.  At this point I plan to
+          implement any of these use cases on an as needed basis.  However
+          the two noteworthy use cases that stick out are use case #1 and #2.
+          #2 is useful for rule used to build ASN.1 strings that are broken
+          apart using the constructed BER encoding.  #1 will be useful as well.
+          Sometimes a pattern may occur multiple times in different contexts
+          and you want to detect them all using a single rule.  In this case,
+          a pattern with a wildcard at the front (case #1) does the job.
+          Furthermore recursive ASN.1 definitions as those found with the
+          LDAP SearchRequest, will require wildcard use case #1.
+          Use cases #3 and #4 are very difficult to implement not to mention
+          expensive in terms of processing.  They just don't seem worth it.
+        </p>
+      </subsection>
+
+      <subsection name="Pattern Matching with Wild Cards">
+        <p>
+          The LDAP SearchRequest requires the use case where the wild card is
+          in the middle of the rule matching pattern because of a recursive
+          ASN.1 definition for Filter expressions.  This means the nesting could
+          be arbitrarily deep so long as the end sequence of tag nesting
+          matches the tail of the pattern and the start sequences matches the
+          front.  The use of a wild card at the front of the pattern would
+          also satisfy this use case although with less specificity.  However
+          it would be easier to implement so we're going to shoot for wild
+          cards in the front of a pattern first which corresponds to use case
+          #1 above.
+        </p>
+
+        <p>
+          Matching for patterns with wild cards in front requires the use of
+          another searchable TagTree similar to the one used for patterns
+          without wild cards.  This new TagTree is built using the reverse
+          sequence of the wild card pattern's tail.  So if a rule is added using
+          the wild card pattern, [*,6,3,9], a branch is created in an empty
+          TagTree in the order 9-3-6.  Besides just adding the reverse
+          branch the tree must be searched for any existing branches that
+          end with 9, 3 and 6.  So if there was a branch 9-3-6-4 based on the
+          pattern [*,4,6,3,9] we would add the rule to node 4 in this branch
+          as well as to node 6 in branch 9-3-6 created by pattern [6,3,9].
+          The fact that node 9-3-6-4 contains the rule can be inferred from
+          it being a child of 9-3-6 while conducting a search to return both
+          rules.  The addition of the rule to both nodes 6 and 4 is a bit
+          redundant, however this is done during rule addition, so we do not
+          have to compute this and collect extra rules while walking the tree
+          to find all matching rules.  We want to follow a strategy where the
+          amount of object creation, and computation is minimal while search.
+          What ever needs to be calculated to avoid the penalty during a search
+          we do during rule pattern registration.  This strategy is followed
+          to conduct searches for matching rules as fast as possible requiring
+          the minimum amount of work.  Here's an example of what a tree with
+          wild cards might look like when registrations with the following
+          patterns and rules are applied:
+        </p>
+
+        <table>
+          <tr>
+            <th>Rule</th>
+            <th>Wild Carded Pattern</th>
+          </tr>
+
+          <tr>
+            <td>R0</td>
+            <td>[*,6,3,9]</td>
+          </tr>
+
+          <tr>
+            <td>R1</td>
+            <td>[*,4,3,9]</td>
+          </tr>
+
+          <tr>
+            <td>R2</td>
+            <td>[*,1,2,5]</td>
+          </tr>
+
+          <tr>
+            <td>R3</td>
+            <td>[*,1,2]</td>
+          </tr>
+
+          <tr>
+            <td>R4</td>
+            <td>[*,3,1,2]</td>
+          </tr>
+
+        </table>
+
+        <br></br>
+
+        <center>
+          <img src="../images/WildTagTree.png"/>
+        </center>
+
+        <p>
+          Furthermore rules added via wild cards are also added to all nodes
+          satisfying the pattern in the primary tag tree used for patterns
+          without wild card patterns.  This is done again to avoid having to
+          gather rules while conducting the search.  It also avoids having to
+          instantiate a List for every event if we are not gathering rules but
+          just returning an existing list from one of the trees.  By adding the
+          rule with the wild card to the tree of patterns without wild cards
+          any node that is selected already accounts for firing rules
+          registered using patterns with wild cards.  Hence the other tree with
+          wild cards does not need to be searched.  Again this is somewhat
+          redundant to do but it allows the return of the List at a single node
+          without having to gather rules into a newly instantiated List.  This
+          all makes more sense when we look at the modified matching algorithm
+          used:
+        </p>
+
+        <ol>
+          <li>walk TagTree without wild cards with nesting pattern</li>
+          <li>if we find a TagNode for the pattern return rules and be done</li>
+          <li>if no match, take the reverse pattern and walk wild tag tree</li>
+          <li>last node before falling off of tree is the matching node</li>
+          <li>if no node matched before walking off then return empty set</li>
+          <li>otherwise return the rules in this last node</li>
+        </ol>
+
+        <p>
+          The way we match is different for both TagTrees.  In the first we're
+          walking the tree using the pattern and if we fall off then we return
+          the empty set.  In the second tree with the wild cards, called the
+          WildTagTree, we walk the tree using the reversed pattern tail without
+          the wild card.  We also consider our selves matching if we can start
+          a walk at all into some node.  Walking off of the WildTagTree selects
+          the last node we were on as the matching node.  This nodes rules need
+          to be fired.  Furthermore in the WildTagTree search we return the
+          empty set only if we cannot even begin a search.
+        </p>
+
+        <p>
+          Also note that since rules for wild card patterns are added to
+          the matching nodes of the regular TagTree, a match in the first
+          TagTree shorts the search into the second WildTagTree.  Another walk
+          in the WildTagTree is not necessary.  However, if the walk on the
+          first TagTree falls off, then a search on the WildTagTree is
+          required.  By taking this approach we're either returning the list of
+          rules for a TagTree node or returning the list of rules for a
+          WildTagTree node.  We never need to create a new list to collect
+          rules in it this way.  The final result set equals the list of rules
+          for some node in any of the two trees.
+        </p>
+      </subsection>
+
+      <subsection name="Composition Stacks">
+        <p>
+	        When rules fire they do something either to a stack or to an object
+          on a stack.  Either way rules pop, push or peek onto digester stacks
+          to do their bidding.  These stacks are used to build the final
+          product  which is a containment tree and hence are referred to as
+          composition stacks.
+        </p>
+        <p>
+	        The digester contains several stacks.  It contains a primitive Java
+          type stack for every Java primitive type.  There is a stack for
+          booleans, bytes, chars, shorts, ints, longs, floats, and doubles.
+          In addition to these stacks there is another Object stack.  Multiple
+          primitive stacks were used to avoid having to wrap primitive types
+          in their object wrappers just to push them onto an Object stack.
+          Some might think this is excessive, and others may think it
+          increases complexity.  If the use of a single Object stack is
+          thought through then it becomes apparent that neither is the case.
+        </p>
+        <p>
+          If a single Object stack were to be used then rules dealing with
+          primitive types would need to wrap them with their respective Object
+          wrappers before pushing them onto the stack.  When popping Objects
+          from the stack rules must know what type the popped object is
+          expected to be.  If multiple stacks were used it would be much
+          easier to just get the primitive from another stack rather than
+          popping the Object stack and casting the popped Object to the Object
+          wrapper just to extract the primitive value.  Having a primitive
+          stack for each Java primitive type is a PITA for the developer, but
+          it simplifies things for the user.  Plus theres the advantage of not
+          having to wrap and extract primitive types just to pass them around
+          in the stack.
+        </p>
+      </subsection>
+
+      <subsection name="Decoder Chaining/Stacking">
+        <p>
+          If one looks closely the digester contains a BERDecoder member.  This
+          decoder is the source of all TLV Tuple events.  The digester is
+          designed as a decoder itself that feeds incoming byte buffers to this
+          underlying BERDecoder.  An inner class, DigesterCallback, is used to
+          implement a BERDecoderCallback.  This inner class bridges BERDecoder
+          TLV Tuple events delivering them to the digester.  The digester in
+          turn updates its tag nesting stack, and processes the new TLV
+          announced with the event.  The digester detects the end of a
+          Protocol Data Unit (PDU) when it pops the last TLV tuple off of the
+          tagStack.  When this happens the constructed object which should
+          have been popped off of the Object stack is returned using a
+          callback as the root of the containment tree.
+        </p>
+
+        <p>
+          The digester is a great example of how decoders can be chained or
+          stacked together in almost a linear combination.  It's like stacking
+          filters on top of one another.
+        </p>
+      </subsection>
+
+    </section>
+  </body>
+</document>

Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,183 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+  <properties>
+    <author email="akarasulu@apache.org">Alex Karasulu</author>
+    <title>BEREncoder Design</title>
+  </properties>
+  <body>
+    <section name="Planning">
+      <p>
+        Trying to figure out how to keep encoder design simple, symetric and
+        extensible.  At this point this section is really just a brain dump.
+      </p>
+
+      <subsection name="Layering Encoders">
+        <p>
+          It might be a good idea to separate encoder functionality into
+          separate encoder layers.  This way we can isolation operations to
+          divide the responsibilities of each encoder keeping each encoder
+          simple.  We are talking about stacking multiple encoders on top of
+          each other.
+        </p>
+
+        <p>
+          The most primitive teir could be an encoder concerned with writing
+          chunks out to a channel.  On top of that may reside another encoder
+          that converts TLV Tuple Tag, Length and Value events into a stream
+          of chunked buffers.  High level encoders can be designed to take
+          stub instances as and input to produce a stream of TLV tuple events.
+          These events are then pipped into the low level Tuple event encoder,
+          then written out as chunks to a channel.  We really will not be
+          concerned with where the chunks go the event producer consumer
+          pipeline is the primary concern for us.
+        </p>
+
+        <p>
+          There are several benefits to this approach.  Here's a few of them:
+        </p>
+
+        <ul>
+          <li>
+            low level tuple encoder is very simple
+          </li>
+          <li>
+            event streams generated by low level decoders can be piped
+            directly into low level encoders for round trip testing
+          </li>
+          <li>
+            low level tuple encoder can be very efficient and fast because of
+            its simplicity
+          </li>
+          <li>
+            the extra layer of encoding allows for the injection of other
+            facilities like tlv transformation or validation for DER
+          </li>
+          <li>
+            a lower primitive TLV event handling layer simplifies the design
+            and implementation of higher level encoders concerned with the
+            encoding of composite data types
+          </li>
+          <li>
+            the low level encoder isolates buffering and chunking related
+            aspects where other stacked encoders built upon it do not have to
+            be concerned with this
+          </li>
+          <li>
+            Stub specific code does not mix with generic tuple processing code.
+          </li>
+        </ul>
+
+        <p>
+          This low level TLV Tuple event stream encoder will need to receive
+          TLV events somehow.  The encoder hence smust implement some kind of
+          callback.  The fact that it is a callback makes it receive the
+          substrate this way rather than through the conventional encode()
+          pathway.  What then happens to the idea of an encode() method that
+          maintains symetry with the decoder's decode() method?
+        </p>
+
+        <p>
+          The encode() method may still need to be used to turn the substrate
+          (TLV tuples) into bytes in a buffer.  This may be called externally
+          as well as by the callback methods implemented by the encoder.  The
+          callback methods might use encode() after or before performing some
+          house keeping operations.  There are however wholes left wide open
+          with this approach which could interfer with tuple nesting.  Basically
+          we will have two pathways for TLV delivery.  First through the general
+          encode() method second through the TLV event delivery methods.  Which
+          one is authoritative?  Plus the TLV event methods are geared to handle
+          TLV nesting.  Receiving a Tuple via the encode() method could
+          interfere with nesting and proper Tuple serialization into a buffer.
+        </p>
+
+        <p>
+          Perhaps to deal with these pitfalls we need to devise some standards
+          around usage.  The encode() method could be implemented to only allow
+          the encoding of primitive Tuples.  Still in there is the possibility
+          of interfering with nesting and encoding order as
+
+          Do we need all TLV events for the low level encoder?
+
+        </p>
+
+        <p>
+          For an exercise let's look at usage patterns from the top down.  At
+          the topmost level an encoder will be devised to multiplex many
+          specific encoders.  This top level encoder will map stub interfaces
+          to other stub specific encoders with knowledge of the ASN.1 data type
+          and the stub.  When calls are made to encode(Object) the topmost
+          encoder will use the stub interface of the Object to lookup an
+          appropriate stub specific encoder.  This topmost encoder hence
+          multiplexes other encoders based on the argument to encode.  The stub
+          specific encoder recieves the substrate as the argument to the
+          encode(Object) method.  It generates a series of TLV tuple events
+          using a special callback interface to convey the Tuple nesting
+          pattern.  The standard EncoderCallback is not sufficient for this.
+          The topmost multiplexing encoder must recieve the callback events
+          of the stub specific encoders and call its own callback as if it
+          generated each event.  The presence of the stub specific encoders is
+          transparent.  Below this system the low level encoder recieves the
+          TLV tuple events generated and serializes them within buffers emitting
+          buffers as encoded objects to its own callback.  Here's how this all
+          looks:
+        </p>
+      </subsection>
+    </section>
+
+    <section name="Determinate Length Encoder Design">
+      <subsection name="Problems and possible solutions">
+        <p>
+          Creating a determinate length encoder without sacrificing efficiency
+          is not easy.  Making the code easy to manage and read is yet another
+          more difficult matter.  This is already very difficult to manage with
+          the state machine approach we have taken.  Furthermore we have found
+          many clients and servers which reject the use of the indeterminate
+          form even though according to the spec BER allows for such variance in
+          encodings.
+        </p>
+
+        <p>
+          Efficiency is difficult to achieve because we need to know the lengths
+          of nested (TLV) nodes to build constructed nodes with a determinate
+          length encoding.  Since the topmost constructed TLV is the first out
+          the door we cannot transmit it until all nested TLV nodes have been
+          generated with their lengths already calculated.  A brut force
+          approach might produce a TLV tree first before serializing the output
+          to a stream.  This way all nested TLV's and up can have their length
+          fields computed depth first.  This means keeping the entire transfer
+          image in memory as well as the structures needed to manage a tree.
+          Although DoS attacks are not as much of a concern for the encoding
+          phase as they are for decoding in a server, the approach would still
+          result in very inefficient encode operations especially when large
+          PDUs are being transmitted either by a server or a client.  Plus there
+          is the fact that a PDU stub already exists with the same copy of the
+          information making it highly likely that more than 2 times the
+          transfer image will be required.
+        </p>
+
+        <p>
+          We must ask our selves if there is any way to avoid keeping the
+          entire transfer image in memory.  Alan and Alex have discussed the
+          use of referrals to large binary data rather than keeping the data
+          in memory during codec operation.  A referral would correspond to a
+          channel, or a stream to recall or store the data in question.  This
+          way large binary values can be streamed from or to disk.  Eventually
+          stubs will support these references although we do not have the
+          mechanism completely defined yet.  If the same referrence can be held
+          in the place of the V field of a TLV then we can avoid having more
+          than 2X the transfer image in memory.  This however will not be the
+          case when PDU's are tiny with small feilds well below the threshold
+          used to gauge when disk streaming is to occur.  This is still however
+          one means to keep the in memory foot print down when PDU's with large
+          fields are encoded.
+        </p>
+
+        <p>
+          
+        </p>
+      </subsection>
+
+
+    </section>
+  </body>
+</document>

Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,11 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+  <properties>
+    <author email="akarasulu@apache.org">Alex Karasulu</author>
+    <title>BEREncoder Usage</title>
+  </properties>
+  <body>
+    <section name="Coming soon ... ">
+    </section>
+  </body>
+</document>

Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,89 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+  <properties>
+    <author email="akarasulu@apache.org">Alex Karasulu</author>
+    <title>ASN.1 and BER Information</title>
+  </properties>
+  <body>
+
+    <section name="ASN.1 and BER Information">
+      <subsection name="Background">
+        <p>
+          The BER encoding for ASN.1 was defined within ITU specification
+          X.690 along with the Canonical and Distinguished Encoding Rules.
+          A copy of this document along with other useful documents and books
+          on ASN.1 and its encodings can be obtained for free here:
+        </p>
+      </subsection>
+
+      <subsection>
+        <table>
+        <tr><th>Document</th><th>Description</th></tr>
+        <tr>
+          <td>
+            <a href="http://lesbeshy.notlong.com">X.690 (07/02)</a>
+          </td>
+          <td>
+            Information technology - ASN.1 encoding rules: Specification of
+            Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and
+            Distinguished Encoding Rules (DER)
+          </td>
+        </tr>
+
+        <tr>
+          <td>
+            <a href="http://offburie.notlong.com">X.680 (07/02)</a>
+          </td>
+          <td>
+            Information technology - Abstract Syntax Notation One (ASN.1):
+            Specification of basic notation
+          </td>
+        </tr>
+
+        <tr>
+          <td>
+            <a href="http://www.oss.com/asn1/bookreg.html">ASN.1 Complete</a>
+          </td>
+          <td>
+            A verbose yet truely complete book on ASN.1 and various encoding
+            mechanisms.  Easy to read since the author takes almost a
+            conversational tone.
+          </td>
+        </tr>
+
+        <tr>
+          <td>
+            <a href="http://www.oss.com/asn1/bookreg2.html">
+              ASN.1 - Communication between heterogeneous systems</a>
+          </td>
+          <td>
+            Also a very complete book on ASN.1 and various encoding mechanisms.
+            A little more difficult to read but seems to be much better
+            organized and more exacting.  I use both books in conjunction
+            often switching between the two based on my mood :-).  Both are
+            most excellent - thanks to both authors for graciously providing
+            their books online.
+          </td>
+        </tr>
+        </table>
+      </subsection>
+
+      <subsection name="BER Tag, Value, Length Tuples">
+        <p>
+          BER stands for Basic Encoding Rules.  These rules describe how to
+          encode and decode basic data types and composite data structures
+          to and from TLV streams.  A TLV hence may be primitive (atomic) or
+          constructed (nested) where the value component contains other TLVs.
+          The T is for the Tag a numeric type identifier, the L is for the
+          length of the data carried in the third V component, the value.
+          Outside of this very trivial introduction there is very little to
+          the encoding.  Readers should look at the relatively short
+          specification for referrence regarding the exact encoding for
+          various data types using TLV tuples.  The books above also have
+          appendices for the various encodings which are longer than the
+          actual specification yet more explanitory.
+        </p>
+      </subsection>
+    </section>
+  </body>
+</document>
\ No newline at end of file

Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,168 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+  <properties>
+    <author email="akarasulu@apache.org">Alex Karasulu</author>
+    <title>Design Documentation</title>
+  </properties>
+  <body>
+    <section name="Introduction">
+      <p>
+        We're designing the decoder peice using the stateful decoder interfaces
+        that were carved up for the commons-codec API within their stateful 
+        package.  The stateful decoder is designed to maintain state while 
+        processing arriving chunks of TLV data.
+      </p>
+      
+      <p>
+        There are several issues confronted by the decoder while decoding a 
+        stream of TLV trees where TLV tuples are nested within each other.  In
+        general this decoder is viewed as a simple line parser for TLV tuples
+        notifying of their arrival via callbacks.
+      </p>
+      
+      <p>
+        The BER encoder and decoder (codec) will be designed to operate very
+        much like the Simple API for XML (SAX).  It will generate events which
+        are calls on a handler as it encounters low level encoded structures
+        called Tag Length Value (TLV) tuples.  
+      </p>
+      <p>
+        Rather than return values, which could be extremely large, in one peice
+        the decoder for example returns peices of a value until it completes 
+        processing the entire V of the TLV.  This makes the decoder highly 
+        attractive for servers using non-blocking IO and SocketChannels.  This
+        design gives it a small decoding footprint regardless of the size of 
+        the Protocol Data Unit (PDU) being processed.  It also makes it much
+        faster since the decoder deals with a small simple task without much
+        conditional logic for processing a PDU.  We hope the combined benefits
+        of non-blocking IO and this sleek codec, make any BER based protocol
+        server extremely responsive under heavy loads with massive concurrency.
+      </p>
+    </section>
+    
+    <section name="Requirements">
+      <p>
+        The decoder must be fast, have a fixed memory footprint, and be simple.
+        It should perform only one task: notifying content handlers via 
+        callbacks of the arrival of TLV tuples.  While doing so it must maintain
+        state in between calls to decode a chunk of arriving BER encoded data.
+      </p>
+      
+      <p>
+        It should not try to interpret the content of the TLV tuples.  These 
+        aspects are left to be handled by higher level content based facilities
+        that build on top of the BERDecoder.  These higher facilities provide 
+        their own callbacks to build on TLV events.  The SnickersDecoder which
+        transforms ASN.1 BER TLV Tuples into messages uses the BERDecoder in 
+        this way to give meaning to the arriving content.
+      </p>
+    </section>
+    
+    <section name="Object Reuse and Using Primitive Types">
+      <p>
+        The density of TLV tuples encountered during decoder operation will 
+        vary based on message characteristics.  One of the most involved aspects
+        to the decoder is to spit out TLV tuples when emitting events.
+      </p>
+      
+      <p>
+        We could just instantiate a new TLV tuple object for every tuple but
+        this would slow the decoder down and increase the memory footprint 
+        making it less efficient.  For this reason we decided to reuse the same
+        TLV tuple to deliver TLV data via notification events on the callback.
+        The callback implementation must copy the tuple's data if it intends to
+        save the TLV for use later.  Otherwise the decoder will overwrite the 
+        TLV members on the next event.  We leave the option to copy the TLV 
+        upto the higher level facility that way only those TLV tuples of 
+        interest, known only to the content specific handler, can be copied.
+        <b>Why waste space and time on events that will are not of interest?</b>
+      </p>
+       
+      <p>
+        The most complex part of the decoder deals with maintaining state while
+        decoding.  Data can arrive at any time to contain any part of a TLV or
+        multiple TLVs along with parts of others.  Often the fragmentation 
+        signature to the data along with its size will not be known.  
+        Furthermore the nesting of TLVs must be tracked while maintaining state.
+        A stack is used to track the nesting of TLV tuples within a TLV tree.
+      </p>
+      
+      <p>
+        We do not instantiate TLV tuple objects so pushing the one TLV instance 
+        we reuse is pointless.  We could use two approaches here to handle this
+        issue.  First we could just create a new instance only for those TLV
+        tuples that nest others and hence need to be pushed onto the stack.  Or
+        we can use multiple primitive stacks based on an int to store the set
+        of values contained in the tuple.  The second approach leads to greater
+        complexity while the first leads to some overhead in extra instantiation
+        time and memory which is negligable really.  Which approach is best 
+        depends on the number of members in the tuple or in otherwords the 
+        number of primitive int stacks used.
+      </p>
+      
+      <p>
+        We wrote a little test to figure out when one approach out performs the
+        other in the ObjectVersePrimitiveTest stress test.  From tinkering
+        with the parameters of the test case we found the use of primitives to
+        out perform tuple object instantiation when the number of member stacks
+        is less than or equal to 2.  If the number of stacks used is 3 or more
+        then instantiating a constructed TLV object and pushing it onto one 
+        stack is a better strategy.  In our case we have 3 peices of information
+        that need to be pushed and poped together so from this test data the 
+        choice is clear.  We clone the TLV tuple or instantiate a new one for 
+        constructed TLVs that are pushed onto a single stack.  This is faster 
+        and removes the need to manage multiple stacks making the code less 
+        complex.
+      </p>
+    </section>
+    
+    <section name=" To be continued ... ">
+      <p>
+        More to come soon ...
+      </p>
+    </section>
+
+<!-- Some extra material
+    <p>
+  The decoder must be aware of it's current state.  The following states are
+  possible in between TLV tuples:
+</p>  
+<ol>
+  <li>
+    Composing Tag - The decoder starts in this mode.  This is the mode for
+    collecting bytes for the Tag of the TLV in scope.  Bytes are copied into
+    a temporary Tag byte buffer until the Tag data is complete.  Once the 
+    bytes are complete and the Tag int is generated along with a couple of other
+    values (boolean isPrimitive and TypeClassEnum), the Tag byte buffer is
+    cleared and the state switches to the composing length state.  If the Tag
+    data is incomplete and not available until the next chunk of data arrives
+    via another decode call, we remain suspended in this state collecting the 
+    tag bytes.  There are no pushes onto any of the stacks during this state.
+  </li>
+  <li>
+    Composing Length - The decoder starts reading data into a Length byte 
+    buffer until the Length data is complete whether the length data is the 
+    short, long or indeterminate form.  Once the bytes for the length are arrive
+    and the Length int is generated, the Length byte buffer is cleared.  If
+    the Tag represents a primitive type the state switches to the composing 
+    value state.  If the TLV is constructed, the tag and the length are pushed 
+    onto their respective stacks.  The state then switches to the composing Tag
+    state.  If the length data is incomplete and not available until the next 
+    chunk of data arrives via another decode call, we remain suspended in this 
+    state.
+  </li>
+  <li>
+    Composing Value - The decoder is in this mode because the TLV is 
+    primitive.  In this state only two forms of length are valid: the short
+    and long lengths.  In either case we just read the number of bytes 
+    specified for the length during the length composing state.  If the value
+    data is incomplete and not avaiable until the next chunk of data arrives
+    via another decode call we remain suspended in this state.  We transit
+    to the composing Tag state from this state when the value data arrives.
+    Before transiting to the composing Tag state the stacks are popped and
+    the TLV is delivered as a completion event to the callback.
+  </li>
+</ol>
+  -->
+  </body>
+</document>
\ No newline at end of file