You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@directory.apache.org by ak...@apache.org on 2005/02/21 22:44:49 UTC
svn commit: r154725 [1/2] - in incubator/directory/asn1/branches/rewrite:
ber/xdocs/ codec/xdocs/ stub-compiler/xdocs/
Author: akarasulu
Date: Mon Feb 21 13:44:46 2005
New Revision: 154725
URL: http://svn.apache.org/viewcvs?view=rev&rev=154725
Log:
missed the moved files because I did not merge but ran a patch diff of trunk - my bad
Added:
incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/eureka.xml
incubator/directory/asn1/branches/rewrite/ber/xdocs/index.xml
incubator/directory/asn1/branches/rewrite/codec/xdocs/
incubator/directory/asn1/branches/rewrite/codec/xdocs/index.xml
incubator/directory/asn1/branches/rewrite/stub-compiler/xdocs/
incubator/directory/asn1/branches/rewrite/stub-compiler/xdocs/index.xml
Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderDesign.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,552 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>BER Decoder Design</title>
+ </properties>
+
+ <body>
+ <section name="Factors Driving the Design">
+ <p>
+ Several factors drove the design. Some of these were covered in
+ advance in the section on stateful codecs. While implementing these
+ stateful decoders we still must adhere to some rules to not undermine
+ the purpose for implementing stateful decoders in the first place.
+ Other design decisions come from the way the BER encoding is devised.
+ These driving factors are all covered here in this document to reveal
+ the present design while justifing it and the decisions that forged
+ it.
+ </p>
+
+ <subsection name="TLV Nesting">
+ <p>
+ A constructed TLV tuple contains other TLV tuples nested within the
+ Value field, and so on, recursively. The nesting depth is usually
+ indefinate and dependent on the data structures being transformed.
+ This fact drastically affects the way the decoder is designed and
+ hence operates.
+ </p>
+
+ <p>
+ It's always good form to tackle the basis case[s] of any recursive
+ problem. The simplest basis case is the processing of a simple
+ primitive TLV tuple. So for the time being presume we are simply
+ implementing a primitive BER TLV decoder without any nesting.
+ </p>
+
+ <p>
+ While decoding a primitive tuple the decoder must maintain state.
+ State for this simple decoder is an indicator representing whether
+ the Tag, Length, or Value was being processed at the end of the last
+ chunk of substrate data. This is required along with any data
+ already captured for these fields to continue processing where the
+ last chunk left off. Value field data actually is not "accumulated"
+ to maintain state, but for the time being we'll ignore that fact.
+ </p>
+
+ <p>
+ The primitive TLV tuple decoder is easily designed. Build three
+ stateful sub-decoders for each field, the Tag, Length, and the Value
+ fields. A top level decoder deligates the processing state of the
+ fields to each sub-decoder and switches sub-decoders when the
+ indicator changes from Tag, to Length, to Value then back to Tag
+ and so on. When a Value completes processing and before another Tag
+ is read the decoder triggers a callback event. Here's what the
+ hypothetical primitive TLV tuple decoder would look like:
+ </p>
+
+ <center>
+ <img src="../images/PrimitiveTupleDecoder-uml.gif"/>
+ </center>
+
+ <p>
+ Now let's try to figure out how to handle constructed TLV tuples
+ which recursively nest other tuples indefinately. State now
+ is more that just where you left off in the current tuple being
+ processed. When processing an inner tuple the decoder must also know
+ where it left off in the outter tuple to resume processing. More
+ accurately the decoder must maintain the state of every parent and
+ ancestor tuple of the current tuple being processed in the same
+ manner the TLV tuple decoder did for primitive TLV tuples. Hence
+ the state of the decoder is a stack of all ancestor TLV tuple states
+ as well as the state of the tuple currently being processed.
+ </p>
+
+ <p>
+ While processing input for a nested TLV tuple the state of all
+ tuple ancestors must also be updated with the same data so the
+ decoder can determine when their processing is about to complete.
+ This way the decoder does not read TLV tuples adjacent to the
+ constructed TLV tuple, incorrectly presuming that they are part
+ of the constructed TLV tuple.
+ </p>
+
+ <p>
+ When the last inner tuple in a constructed TLV tuple completes, it
+ triggers a callback for itself, then the stack is popped and another
+ callback event is triggered for the completion of the constructed
+ TLV tuple.
+ </p>
+
+ <p>
+ In conclusion, the state of a BER decoder, used to process both
+ primitive and constructed TLV tuples, must take into accout the
+ the processing state of every tuple ancestor in the stack of nested
+ tuples. Otherwise state cannot be maintain. Just how this is
+ efficiently managed is the topic of the next few subsections.
+ </p>
+ </subsection>
+
+ <subsection name="Value Processing">
+ <p>
+ If the decoder accumulates encountered Value field octets to maintain
+ state then we have a problem. First off the size of the Value could
+ be massive and often varies. We want to maintain a fixed maximum
+ memory footprint to the decoder. This goes out the window if Value
+ field content is accumulated within buffers to maintain tuple
+ processing state. Furthermore, with nesting, every ancestor tuple in
+ the nesting stack would maintain a copy of the topmost tuple's Value
+ field when that tuple is about to complete processing. The number of
+ copies is a function of the nesting depth, so the deeper the nesting,
+ the more memory is wastefully consumed. This is totally unacceptable
+ and it undermines the reason for devising stateful codecs in the
+ first place.
+ </p>
+
+ <p>
+ To avoid this problem we must not accumulate Value field octets
+ (bytes) while maintaining state. Unlike the Value field, the other
+ Tag and Length fields are limited and often account for only a few
+ bytes within TLV tuples. To maintain state however the decoder still
+ has to perform some accounting to determine when outter tuples in the
+ nesting stack complete processing. The decoder maintains state by
+ using Value byte counters processed rather than using an accumulator
+ to store the Value bytes. This way the decoder can compare a tuple's
+ Length field with it's Value byte counter to determine if processing
+ is complete.
+ </p>
+ </subsection>
+
+ <subsection name="Extending DecoderCallback">
+ <p>
+ The next question is how the decoder propagates TLV tuple Values to
+ the target receiving the TLV tuple stream? If the standard
+ <code>DecoderCallback.decodeOccurred()</code> method is designed to
+ be called upon TLV processing completion how do we avoid collecting
+ the Value while getting the Value bytes somehow to an decoder user
+ via callbacks?
+ </p>
+
+ <p>
+ The answer is to use yet another callback. The DecoderCallback
+ interface is extended by actually adding three extra callback
+ methods: one for each field. The BERDecoderCallback interface
+ extends the DecoderCallback interface and adds the following
+ methods:
+ </p>
+
+ <ul>
+ <li>void tagDecoded( Tuple tlv )</li>
+ <li>void lengthDecoded( Tuple tlv )</li>
+ <li>void partialValueDecoded( Tuple tlv )</li>
+ </ul>
+
+ <p>
+ The following diagram shows the decoder interfaces and a do nothing
+ adapter implementation for convenience:
+ </p>
+
+ <center>
+ <img src="../images/BERDecoderCallback.gif"/>
+ </center>
+
+ <p>
+ For a single TLV all methods except for the partialValueDecoded()
+ method is invoked at most once. As its name suggests peices of
+ Value are delivered encapsulated by the Tuple argument rather than
+ the entire Value field. Hence the method can be called zero or more
+ times while processing a TLV.
+ </p>
+
+ <p>
+ The extended decoder callback interface allows the decoder to chunk
+ Value fields and hence maintain a fixed maximum footprint. The
+ partialValueDecoded callback is a bit misleading however. It really
+ does not decode any Value bytes based on the Tag of the TLV. It
+ simply hands off the raw value bytes to the callback, this part of
+ the decode is left to higher level decoders built on top of the
+ BERDecoder. However all primitive type decode operations are
+ provided by the BER codec.
+ </p>
+ </subsection>
+
+ <subsection name="Constructed Values">
+ <p>
+ The values of constructed TLV tuples are other tuples. Their Values
+ are already decoded by the BER decoder which triggers TLV events for
+ the nested TLV tuples. Calls to the partialValueDecoded() method
+ hence are never made. Furthermore the decoder transits from the
+ Length state of processing to the Tag state just after completing
+ the decode of a constructed tuple Length field. This is because the
+ next tuple to process is a nested tuple with its Tag following the
+ constructed tuple's Length field.
+ </p>
+
+ <p>
+ Constructed TLV tuples never have partialValueDecoded() called. Only
+ primitive TLV tuples have Value octets delivered to this callback.
+ This makes state handling withing the decoder a bit tricky but the
+ complexity for the rewards wreaked is well worth it.
+ </p>
+ </subsection>
+ </section>
+
+ <section name="State Management">
+ <p>
+ There are two parts to managing state for the BERDecoder: stack
+ based ancestor Value accounting state, and current tuple processing
+ state.
+ </p>
+
+ <subsection name="Current TLV State Management">
+ <p>
+ While processing the tuple in scope a type safe enumeration is used
+ to track the current tuple processing state which could be in either
+ Tag, Length, or Value processing states. Subordinate decoders are
+ used to decode the Tag, and Length fields. The Value field unlike
+ these fields does not have a corresponding decoder: it does not need
+ one since primitives TLV Values are not decoded but returned raw.
+ The sub-decoders for Tag and Length manage the accumulation of field
+ bytes between chunked decode operations. The following diagram
+ displays the helper classes used to manage the current TLV processing
+ state:
+ </p>
+
+ <center>
+ <img src="../images/state-helper-classes.gif"/>
+ </center>
+
+ <table>
+ <tr><th>Color</th><th>Group</th></tr>
+ <tr><td>Red</td><td>Generic Helper Classes</td></tr>
+ <tr><td>Yellow</td><td>Tag Specific Classes</td></tr>
+ <tr><td>Purple</td><td>Length Specific Classes</td></tr>
+ </table>
+
+ <p>
+ The tag specific classes include the Tag and TagDecoder classes.
+ The Tag class handles the extraction of various Tag embedded fields
+ like the constructed bit and the tag's type class. It also collects
+ tag octets up to 4 octets only using a special TagOctetCollector.
+ The TagDecoder is just a stateful decoder implementation wrapper
+ using Tag methods.
+ </p>
+
+ <p>
+ The TypeClass class is a type safe enumeration of the four type
+ classes of Tags:
+ </p>
+
+ <ul>
+ <li>UNIVERSAL</li>
+ <li>APPLICATION</li>
+ <li>CONTEXT_SPECIFIC</li>
+ <li>PRIVATE</li>
+ </ul>
+
+ <p>
+ Once the Tag accumulator collects all tag octets it determines and
+ sets the TypeClass corresponding to the tag.
+ </p>
+
+ <p>
+ The TagEnum class is an abstract base class for type safe tag
+ enumerations. This special type safe enumeration associates a tag
+ label with two integers: the tag value and the tag id. The tag value
+ is an integer representation of the tag whereas the id is just the
+ just the id field of the tag. This btw is the main reason why the
+ TagCollector only accepts four bytes for building the tag: an integer
+ is essentially used as the backing store for the tag data. The
+ reasons for this are explained within the tag handling section to
+ follow.
+ </p>
+
+ <p>
+ The Length and LengthDecoder operate very much in the same fashion
+ as do the Tag and TagDecoder. The same pattern is applied to both
+ pairs of classes. The primary difference is the use of a ByteBuffer
+ within the Length class rather than a custom data structure like the
+ TagOctetCollector to accumulate Length bytes (octets). The main
+ reason for this is that a limit of 4 tag octets have been imposed on
+ the decoder which in fact is contrary to the BER specification.
+ Length values well above the 4 byte integer are surely possible for
+ TLV values although improbable.
+ </p>
+
+ <p>
+ The BERDecoderState class is another type safe enumeration with the
+ following values: TAG, LENGTH and VALUE. It obviously represents the
+ processing state of the TLV tuple currently in scope.
+ </p>
+ </subsection>
+
+ <subsection name="Stack State Management">
+ <p>
+ The BERDecoder UML is displayed below to show some of the memebers
+ and operations available. Pay special attention to the tlvStack
+ member and the getTupleStack() package friendly Stack accessor used
+ for stack state management:
+ </p>
+
+ <center>
+ <img src="../images/BERDecoder.gif"/>
+ </center>
+
+ <p>
+ The tlvStack is a Stack of Tuple instances. The last subsection
+ contains a UML diagram with the Tuple class. Tuple objects are the
+ objects handed to the the decodeOccurred() method of the callback.
+ They basically encapsulate a bunch of information associated with a
+ TLV tuple in one object. This includes accounting information used
+ to determine the processing state of constructed TLVs. The Stack of
+ Tuples hence stores the state information associated with ancestor
+ Tuples currently out of scope.
+ </p>
+
+ <p>
+ With every chunk of substrate processed for the tuple currently in
+ scope, the accounting information in every Tuple of the Stack is
+ updated. Again, this tracks how much of the anscestor's Value field
+ has been processed. Specifically the length and index fields of
+ Tuple objects are used to determine how much of the TLV has been
+ read.
+ </p>
+ </subsection>
+ </section>
+
+ <section name="Tuple Recycling">
+
+ <subsection name="TLV Tuple Density">
+ <p>
+ BER TLV streams will contain varying densities of TLV tuples. The
+ density of the tuples depends on the nature of the content. Streams
+ with many small primitive types crammed together will generate TLV
+ tuples very rapidly while processing the encoded substrate. Every
+ few, even couple bytes might produce a new tuple.
+ </p>
+
+ <p>
+ If we instantiated a new Tuple instance and populated it for every
+ few bytes in the stream, then performance will degrade significantly
+ while processing streams with high TLV tuple densities. Futhermore
+ rapid object creation rates would seriously tax the garbage
+ collector. To avoid these negative effects of instantiating new TLV
+ tuples we need to reuse the same Tuple allowing interested parties
+ to either clone or copy the contained information while processing
+ the tuple. More often than not, most tuples will be ignored. It
+ would be wasteful to create a new Tuple object for every TLV tuple
+ encountered when some or most might potentially be ignored.
+ </p>
+ </subsection>
+
+ <subsection name="Problem With Recycling a Tuple">
+ <p>
+ If we avoid instantiating new TLV Tuples and resuse the same Tuple
+ object, we run into a problem. First we'll loose data when we
+ attempt to push the tuple onto the tlvStack when the next TLV is
+ processed.
+ </p>
+
+ <p>
+ One solution to this problem is to clone constructed Tuples before
+ pushing the tuple onto the tlvStack. Hence only primitives would
+ reuse the same Tuple. This works well because primitive tuple data
+ does not need to be maintained past its scope. If the data needs to
+ be copied, it can be copied by the application using the decoder.
+ This makes sense since the application determines which Tuple values
+ to store or ignore.
+ </p>
+
+ <p>
+ This solves performance bottlenecks with substrates that are dense
+ with primitive tuples. However the problem will re-emerge if the
+ substrate is dense with deeply nested primitive tuples. If every
+ primitive is embedded deep within its own cavern of nested TLV
+ tuples then we'll be closer to instantiating a Tuple object for
+ almost every TLV encountered. The perfect substrate under this
+ scheme, of course, would be a single primitive element but beyond
+ that it would be flat nesting patterns where as many primitives TLV
+ tuples are stuffed into every contstructed TLV tuple as possible.
+ </p>
+
+ <p>
+ The deeply embedded high density of constructed TLV tuples is highly
+ unlikely although possible for recursive ASN.1 definitions.
+ Regardless of these situations producing a high density of
+ constructed TLV tuples, the nesting structures will often share the
+ same parents so the TLV tuple to Tuple object instantiation ration
+ would rarely approach 1:1.
+ </p>
+
+ <p>
+ Over all we cannot determine the ratio of constructed to primitive
+ TLV tuples encountered within a substrate. However one would like
+ to believe that complex structures do not predominate, and that
+ protocol designers opt for simpler structures whenever possible.
+ With this sensible hypothesis reuse of primitive TLV tuples and the
+ cloning of constructed TLV tuples seems like a viable strategy for
+ managing excessive object instantiations.
+ </p>
+ </subsection>
+
+ </section>
+
+ <section name="Integer Representation For Tags">
+ <p>
+ According to the BER encoding specification, X.690, a Tag id can be
+ any arbitrary value: there is no limitation to the size of an id.
+ In practice ids are claimed incrementally by ASN.1 modules from the
+ CONTEXT_SPECIFIC and APPLICATION type classes. These values for
+ any reasonable protocol are far less than 100 ids. Experts like
+ Larmouth claim they have never seen Tag ids larger than a thousand.
+ So we don't bother representing Tags within a buffer for the full
+ expressivity of the specification when we know of reasonable soft
+ limits to the Tag id.
+ </p>
+
+ <subsection name="Four Tag Octets Are Enough">
+ <p>
+ In most cases, one or two octets suffice for encoding a tag and its
+ identifier. In some cases three bytes may rarely be used. It's
+ highly improbable that we'll ever see four or more bytes to be used
+ to encode a tag: even the experts have never seen this before.
+ The only way I can conceive of this is if computers begin devising
+ or generating protocols :-).
+ </p>
+
+ <p>
+ According to the specification the long form can encode the
+ following maximum identifier sizes with various octet lengths:
+ </p>
+
+ <table>
+ <tr>
+ <th>Octets</th>
+ <th>Maximum Tag Id</th>
+ <th>Calculation</th>
+ </tr>
+
+ <tr>
+ <td>1</td>
+ <td>30</td>
+ <td>2^5-1</td>
+ </tr>
+
+ <tr>
+ <td>2</td>
+ <td>127</td>
+ <td>2^7-1</td>
+ </tr>
+
+ <tr>
+ <td>3</td>
+ <td>16,383</td>
+ <td>2^14-1</td>
+ </tr>
+
+ <tr>
+ <td>4</td>
+ <td>2,097,151</td>
+ <td>2^21-1</td>
+ </tr>
+
+ <tr>
+ <td>5</td>
+ <td>268,435,455</td>
+ <td>2^28-1</td>
+ </tr>
+ </table>
+
+ <p>
+ As we can see 3-4 octets encode a maximum tag id we can live with.
+ One might expect the max tag id for say 4 octets would be 2^(4*8)-1
+ but its not. We loose some bits, to be able to encode a variable
+ length tag with the long form. In the long form all the bits from
+ the first octet are wasted and a bit from each octet there after is
+ lost to be able to terminate the tag field. Hence if we started out
+ with 4 bytes or 32 bits then we're actually using 32-8-3 or 21 of
+ the original bits for storing the value of an id. This yeilds a max
+ id value of 2^21-1 for 32 bits or 4 octets.
+ </p>
+ </subsection>
+
+ <subsection name="Integer Encoding">
+ <p>
+ Tags are used to match for TLV tuples. Nothing matches faster than an
+ integer using a switch statement. It makes sense to store and
+ manage raw Tag octets within the bytes of a primitive Java integer
+ rather than within a byte buffer. This way switch statements can be
+ used to quickly match TLV tuples based on their integer encoding for
+ the first four tag bytes. Furthermore the stub compiler can
+ prefabricate type safe enums whose values equal the integer encoding
+ of a tag's four octets. Matching for TLV tuples by tag then is as
+ fast as it can get using this integer encoding. This btw is the sole
+ reason why we have the abstract class, TagEnum, which extends
+ ValuedEnum. It's a type safe enumeration for Tag octets encoded as
+ integers.
+ </p>
+
+ <p>
+ Encoding only the four octets of the raw tag, limits the maximum
+ value of the id that a TLV's tag field can represent to 2^21-1.
+ This was the reason for the discussion in the section above. We
+ simply will not need an id bigger than this. So we decided to
+ break with the specification and restrict the max value of the tag
+ id down to 2^21-1 rather than leave it unbounded.
+ This limitation allows us to represent the first four octets of the
+ tag field as an integer thereby speeding up TLV pattern matching
+ considerably.
+ </p>
+
+ <p>
+ The TagOctetCollector is specifically designed to accumulate the four
+ octets of the Tag. It stores the first octet in the
+ most significant byte of the int, the second in the next most
+ significant and so on until the last of the four octets is stored
+ within the least significant byte of the integer. The diagram below
+ shows just how 4 bytes are assembled into the integer:
+ </p>
+
+ <center>
+ <img src="../images/tag-integer-encoding.png"/>
+ </center>
+
+ <p>
+ Note that if their were only 3 tag octets collected, then the bits
+ for Octet 4 would all be zero: bits 0-7 in the integer would be
+ zeros. Likewise if only one octet were used then bits 23-0 would
+ be zero'd out within the 32-bit integer.
+ </p>
+
+ <p>
+ The integer encoding for tags are not leveraged here at the level of
+ the BERDecoder. At this low level the decoder does not care about
+ tags other than those in the UNIVERSAL type class reserved for
+ detecting TLV tuple termination sequences within the stream. Later
+ within the BERDigester where Tag pattern matching is used to make
+ sense of these TLV tuple streams, the integer encoding and the
+ TagEnum are used heavily. Rather than add more complexity to the
+ BERDecoder we stop here and build upon it by stacking another
+ decoder, the BERDigester on top. The BERDecoder decodes encoded
+ substrate streams into TLV Tuples and announces their arrival by
+ callbacks which are recieved by the BERDigester. It is then upto
+ the BERDigester to process these TLV tuples using decoding rules
+ triggered by tag nesting patterns. How approapriate! Data encoded
+ using Basic Encoding Rules is decoded using rules that process a TLV
+ tuple stream. More information regarding the design of the
+ BERDigester can be found <a href="BERDigesterDesign.html">here</a>.
+ </p>
+ </subsection>
+ </section>
+ </body>
+</document>
Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDecoderUserGuide.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,11 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>BERDecoder Usage</title>
+ </properties>
+ <body>
+ <section name="Coming soon ... ">
+ </section>
+ </body>
+</document>
Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BERDigesterDesign.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,474 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>BER Digester Design</title>
+ </properties>
+ <body>
+ <section name="Introduction">
+
+ <subsection name="What is it?">
+ <p>
+ The digester is a high level decoder that builds object
+ containment trees from a stream of events it receives via callback
+ events. These callbacks notify of the start and end of BER tuples
+ composed of a Tag, a Length and a Value within the underlying stream.
+ Rules, registered with Tag nesting patterns are used to build
+ containment trees using an object stack. Rules are triggered when
+ Tag nesting patterns match the patterns registered with the rule.
+ </p>
+ </subsection>
+
+ <subsection name="Why is it called a digester?">
+ <p>
+ The idea for a higher level decoder that builds object containment
+ trees through the action of rules came when I was looking at the
+ Jakarta Commons (SAX event) Digester. SAX events are very similar
+ to the TLV callback events that are emminated by the BERDecoder.
+ The Commons Digester consumed these events. While processing these
+ events it tracked the nesting of XML elements and fired registered
+ rules when nesting patterns were matched. These rules performed
+ operations on the Digester's stack: objects were instantiated and
+ pushed, or populated and then popped to build the containment tree.
+ </p>
+
+ <p>
+ The BERDigester applies the same pattern as does the Common's
+ Digester on a similar underlying event stream. Although BER Tuple
+ events carry completely different peices of information, like SAX
+ event streams BER TLV event streams signal the nesting pattern of
+ underlying constructs within the stream. This hence reviels the
+ structure of an ASN.1 data type or in the case of SAX the structure
+ of an XML document. Both digesters consume events emminating from
+ some event source to build object containment trees based on the
+ nesting pattern revieled by those events. The containment trees are
+ built through the action of rules triggered by matching nesting
+ patterns. Just like the Commons Digester, the BERDigester uses
+ stacks to share objects across rule firings.
+ </p>
+
+ <p>
+ It's safe to say that there is such a thing as a digester pattern.
+ What better way exists to refer to the pattern and those peices of
+ software implementing it, than to just call them digesters? I could
+ not think of any and that's why we call it the BERDigester. The BER
+ simply distinguishes it from the Commons Digester so there is no
+ confusion. (Note: BERDigester may be renamed to TupleDigester)
+ </p>
+ </subsection>
+
+ <subsection name="Rules and Stacks Are Good">
+ <p>
+ Our ultimate goal is to build a stub compiler that generates
+ encoder/decoder pairs for the various ASN.1 encodings: BER, DER, XER,
+ and PER. The compiler also generates the classes used by a digester
+ to build the containment tree of the ASN.1 data structures
+ transmitted. Generation of the stubs is not that hard to do.
+ However building a decoder that assembles containment trees using
+ those stubs and an encoded stream is not easy to do.
+ </p>
+
+ <p>
+ Stacks help out by reducing the complexity of the problem. A stack
+ can be used first of all to share objects across rule firings.
+ Secondly its a perfect data structure for tracking nesting patterns
+ or managing operations on object containment trees. State and scope
+ are maintained easily using a stack.
+ </p>
+
+ <p>
+ If simple stack based instructions are devised to instantiate stub
+ objects, push them and pop them, along with primitives and other
+ classes, then the complexity involved with generating decoders
+ diminishes greatly. Furthermore, if those instructions were
+ declaritive rules triggered based on nesting patterns then decoder
+ generation is well a cake walk. All a decoder needs to do is
+ register rules based on tag nesting patterns associated with the
+ ASN.1 modules with a BERDigester. A generic set of rules may be
+ created or rules can be generated to be registered. The generated
+ decoder would hence be the setup of the BERDigester by registering
+ rules with it for tag patterns. Code would also be added to funnel
+ data into the underlying BERDecoder as well as code to respond to
+ the callbacks generated by the top level PDU decoder.
+ </p>
+ </subsection>
+ </section>
+
+ <section name="BERDigester Design">
+ <subsection name="Nesting Stack">
+ <p>
+ The design documentation for the BERDecoder discusses an encoding
+ for Tags using the first four octets of a Tag field. These four
+ octets are stored within a primitive Java int. These ints represent
+ the raw Tag of a TLV tuple. We could push these raw Tag ints onto
+ an stack: let's for the time being call this primitive Java int stack
+ the tagStack. Every time a new TLV Tuple is delivered, the digester
+ pushes the Tag int onto the tagStack. When processing of the TLV
+ Tuple is finished, the tagStack is popped removing the Tag int for
+ the Tuple. The sequence of Tag ints within the tagStack represent
+ the reverse nesting order of TLV Tuples. Rule's are registered with
+ the digester using int[]s for patterns, these int[]s are compared
+ against the sequence of ints within the tagStack, hence the tag
+ nesting pattern. All rules registered with a matching pattern of
+ Tag ints are fired in the order they were registered.
+ </p>
+ <p>
+ This sounds pretty simple and straight forward at first glance but
+ there are complications. The BER encoding, provides for two ways in
+ which string types can be encoded: as a primitive TLV or as a
+ constructed TLV. By string types, we refer to all character based
+ string types like IA5String as well as bit strings and octet strings.
+ String types can be encoded using a single TLV as a primitive type.
+ This way the entire string is kept within the value field. The
+ designers of BER were smart enough to realize that value chunking
+ would need to occur for large strings to enable efficient decoding
+ and encoding. Hence these string types can optionally be encoded as
+ constructed TLV tuples with smaller nested tuples carrying fragments
+ of the string. This way encoders would not need to calculate the
+ length of a large strings in advance or keep the entire string in
+ memory. The value can be fragmented and chucked explicitly by the
+ encoding. This is great, however the Tag int for a primitive TLV of
+ a string type will not equal the Tag int for the constructed TLV of
+ the same string type. Pattern matching cannot occur under these
+ circumstances.
+ </p>
+ <p>
+ This complication forced me to rethink the use of raw tag integers.
+ Instead of pushing the tag int as is onto the tagStack we could
+ always reduce the raw tag to some canonical form then push that int
+ instead. Furthermore, all patterns supplied to register rules must
+ also have their ints scrubbed within the int[] pattern, so pattern
+ matching will occur using only the canonical representation of Tag
+ ints rather than actual Tag integers.
+ </p>
+ <p>
+ The primitive/constructed flag exists on a Tag's first octet as a
+ single bit. We want to prevent variations in this bit from throwing
+ off our pattern matching. Choosing a canonical form is like picking
+ whether or not to lower case or to upper case letters before
+ attempting a comparison. In our case :-), we have chosen the
+ canonical form to be the tag with this primitive/constructed bit
+ turned off. So before any comparison occurs all tag integers will
+ be guaranteed to have this bit flag turned off. In fact all TagEnum
+ int values prefabricated by the compiler will have the Tag int
+ scrubbed. Meaning the primitive/constructed bit will be in the off
+ position and hence in the canonical form.
+ </p>
+ </subsection>
+
+ <subsection name="Pattern Matching">
+ <p>
+ To trigger rules the digester checks every new tag nesting pattern
+ to see if it matches any of the patterns registered with a rule.
+ Multiple patterns may be registered with a rule to allow it to fire
+ under different contexts. With every in coming event which changes
+ the tag nesting pattern, the digester must see if the tag nesting
+ pattern matches any of the patterns for every registered rule. This
+ could be an expensive operation to have to perform with every event
+ if the rule and pattern base is large. Every registered rule pattern
+ would have to be tested for equality with the tag nesting pattern to
+ determine if the corresponding rule needs to fire.
+ </p>
+
+ <p>
+ To keep pattern matching overheads constant regardless of the number
+ of rules registered, a special data structure, a search tree called
+ the TagTree is used. The tree is composed of TagNodes and is built
+ as rules and their patterns are registered. The tree's tag nodes
+ simply correspond to a tag value, and have a list of rules. Tag
+ nodes maintain a hash of children keyed by their tag value. Here's
+ an example of a TagTree for the following rule pattern registrations:
+ </p>
+
+ <table>
+ <tr>
+ <th>Rule</th>
+ <th>Pattern</th>
+ </tr>
+
+ <tr>
+ <td>R0</td>
+ <td>[1,2,3,4,5]</td>
+ </tr>
+
+ <tr>
+ <td>R1</td>
+ <td>[1,2,5]</td>
+ </tr>
+
+ <tr>
+ <td>R2</td>
+ <td>[1,2,3]</td>
+ </tr>
+
+ <tr>
+ <td>R3</td>
+ <td>[4,4,5]</td>
+ </tr>
+
+ <tr>
+ <td>R0</td>
+ <td>[4,1,2,4]</td>
+ </tr>
+ </table>
+
+ <br></br>
+
+ <center>
+ <img src="../images/TagTree.png"/>
+ </center>
+
+ <p>
+ Whenever possible nodes in the tree are reused while adding new
+ patterns and rules to the tree. Each new pattern added to the tree
+ corresponds to a new unique path within the tree. Rule registration
+ is where the true cost of the tree's construction is incurred. In
+ the solid state the tree is only used to rapidly search for the set
+ of rules to fire based on the current nesting pattern. The search
+ occurs by walking the tree with the current nesting pattern. For
+ example a walk using the [1,2,3] nesting pattern takes us from the
+ root to tag node 1, then 2 and finally tag node 3. Once the pattern
+ is exhausted, meaning all tag values are consumed in the walk, the
+ rules at the final node are returned as the matching set that need to
+ be fired. If there are no rules at that node, then the set is empty.
+ If the pattern "walks off the tree", where no child with a tag value
+ is available to continue the walk on the tree, the search ends
+ returning the empty set.
+ </p>
+
+ <p>
+ Without the tag tree every pattern registered with a rule would have
+ to be compared. This is an O(nm) operation where n is the number of
+ pattern registrations and m is the size of the pattern. With the tag
+ tree the search time is only O(m) where only the size of the pattern
+ determine the search time. This is much more acceptable.
+ </p>
+
+ <p>
+ The only remaining problem at this point is how to implement wild
+ cards with pattern matching. First of all a special reserved tag
+ int value must be selected as the wild card ("*") value so it does
+ not conflict with any valid tags we might encounter. I think any of
+ the UNIVERSAL tags above 32 can be used for this but I would have to
+ check. Plus there are four kinds of wildcard use cases which are
+ possible:
+ </p>
+
+ <ol>
+ <li>wildcard at the front of the pattern</li>
+ <li>wildcard at the end of the pattern</li>
+ <li>wildcard in the middle of the pattern</li>
+ <li>two or more wildcards in a pattern</li>
+ </ol>
+
+ <p>
+ There is no way we would accomodate all these use cases. First off
+ some of them will be rarely if ever used. Secondly their
+ implementation would either be impossible or too difficult to
+ implement for the return in expressivity. At this point I plan to
+ implement any of these use cases on an as needed basis. However
+ the two noteworthy use cases that stick out are use case #1 and #2.
+ #2 is useful for rule used to build ASN.1 strings that are broken
+ apart using the constructed BER encoding. #1 will be useful as well.
+ Sometimes a pattern may occur multiple times in different contexts
+ and you want to detect them all using a single rule. In this case,
+ a pattern with a wildcard at the front (case #1) does the job.
+ Furthermore recursive ASN.1 definitions as those found with the
+ LDAP SearchRequest, will require wildcard use case #1.
+ Use cases #3 and #4 are very difficult to implement not to mention
+ expensive in terms of processing. They just don't seem worth it.
+ </p>
+ </subsection>
+
+ <subsection name="Pattern Matching with Wild Cards">
+ <p>
+ The LDAP SearchRequest requires the use case where the wild card is
+ in the middle of the rule matching pattern because of a recursive
+ ASN.1 definition for Filter expressions. This means the nesting could
+ be arbitrarily deep so long as the end sequence of tag nesting
+ matches the tail of the pattern and the start sequences matches the
+ front. The use of a wild card at the front of the pattern would
+ also satisfy this use case although with less specificity. However
+ it would be easier to implement so we're going to shoot for wild
+ cards in the front of a pattern first which corresponds to use case
+ #1 above.
+ </p>
+
+ <p>
+ Matching for patterns with wild cards in front requires the use of
+ another searchable TagTree similar to the one used for patterns
+ without wild cards. This new TagTree is built using the reverse
+ sequence of the wild card pattern's tail. So if a rule is added using
+ the wild card pattern, [*,6,3,9], a branch is created in an empty
+ TagTree in the order 9-3-6. Besides just adding the reverse
+ branch the tree must be searched for any existing branches that
+ end with 9, 3 and 6. So if there was a branch 9-3-6-4 based on the
+ pattern [*,4,6,3,9] we would add the rule to node 4 in this branch
+ as well as to node 6 in branch 9-3-6 created by pattern [6,3,9].
+ The fact that node 9-3-6-4 contains the rule can be inferred from
+ it being a child of 9-3-6 while conducting a search to return both
+ rules. The addition of the rule to both nodes 6 and 4 is a bit
+ redundant, however this is done during rule addition, so we do not
+ have to compute this and collect extra rules while walking the tree
+ to find all matching rules. We want to follow a strategy where the
+ amount of object creation, and computation is minimal while search.
+ What ever needs to be calculated to avoid the penalty during a search
+ we do during rule pattern registration. This strategy is followed
+ to conduct searches for matching rules as fast as possible requiring
+ the minimum amount of work. Here's an example of what a tree with
+ wild cards might look like when registrations with the following
+ patterns and rules are applied:
+ </p>
+
+ <table>
+ <tr>
+ <th>Rule</th>
+ <th>Wild Carded Pattern</th>
+ </tr>
+
+ <tr>
+ <td>R0</td>
+ <td>[*,6,3,9]</td>
+ </tr>
+
+ <tr>
+ <td>R1</td>
+ <td>[*,4,3,9]</td>
+ </tr>
+
+ <tr>
+ <td>R2</td>
+ <td>[*,1,2,5]</td>
+ </tr>
+
+ <tr>
+ <td>R3</td>
+ <td>[*,1,2]</td>
+ </tr>
+
+ <tr>
+ <td>R4</td>
+ <td>[*,3,1,2]</td>
+ </tr>
+
+ </table>
+
+ <br></br>
+
+ <center>
+ <img src="../images/WildTagTree.png"/>
+ </center>
+
+ <p>
+ Furthermore rules added via wild cards are also added to all nodes
+ satisfying the pattern in the primary tag tree used for patterns
+ without wild card patterns. This is done again to avoid having to
+ gather rules while conducting the search. It also avoids having to
+ instantiate a List for every event if we are not gathering rules but
+ just returning an existing list from one of the trees. By adding the
+ rule with the wild card to the tree of patterns without wild cards
+ any node that is selected already accounts for firing rules
+ registered using patterns with wild cards. Hence the other tree with
+ wild cards does not need to be searched. Again this is somewhat
+ redundant to do but it allows the return of the List at a single node
+ without having to gather rules into a newly instantiated List. This
+ all makes more sense when we look at the modified matching algorithm
+ used:
+ </p>
+
+ <ol>
+ <li>walk TagTree without wild cards with nesting pattern</li>
+ <li>if we find a TagNode for the pattern return rules and be done</li>
+ <li>if no match, take the reverse pattern and walk wild tag tree</li>
+ <li>last node before falling off of tree is the matching node</li>
+ <li>if no node matched before walking off then return empty set</li>
+ <li>otherwise return the rules in this last node</li>
+ </ol>
+
+ <p>
+ The way we match is different for both TagTrees. In the first we're
+ walking the tree using the pattern and if we fall off then we return
+ the empty set. In the second tree with the wild cards, called the
+ WildTagTree, we walk the tree using the reversed pattern tail without
+ the wild card. We also consider our selves matching if we can start
+ a walk at all into some node. Walking off of the WildTagTree selects
+ the last node we were on as the matching node. This nodes rules need
+ to be fired. Furthermore in the WildTagTree search we return the
+ empty set only if we cannot even begin a search.
+ </p>
+
+ <p>
+ Also note that since rules for wild card patterns are added to
+ the matching nodes of the regular TagTree, a match in the first
+ TagTree shorts the search into the second WildTagTree. Another walk
+ in the WildTagTree is not necessary. However, if the walk on the
+ first TagTree falls off, then a search on the WildTagTree is
+ required. By taking this approach we're either returning the list of
+ rules for a TagTree node or returning the list of rules for a
+ WildTagTree node. We never need to create a new list to collect
+ rules in it this way. The final result set equals the list of rules
+ for some node in any of the two trees.
+ </p>
+ </subsection>
+
+ <subsection name="Composition Stacks">
+ <p>
+ When rules fire they do something either to a stack or to an object
+ on a stack. Either way rules pop, push or peek onto digester stacks
+ to do their bidding. These stacks are used to build the final
+ product which is a containment tree and hence are referred to as
+ composition stacks.
+ </p>
+ <p>
+ The digester contains several stacks. It contains a primitive Java
+ type stack for every Java primitive type. There is a stack for
+ booleans, bytes, chars, shorts, ints, longs, floats, and doubles.
+ In addition to these stacks there is another Object stack. Multiple
+ primitive stacks were used to avoid having to wrap primitive types
+ in their object wrappers just to push them onto an Object stack.
+ Some might think this is excessive, and others may think it
+ increases complexity. If the use of a single Object stack is
+ thought through then it becomes apparent that neither is the case.
+ </p>
+ <p>
+ If a single Object stack were to be used then rules dealing with
+ primitive types would need to wrap them with their respective Object
+ wrappers before pushing them onto the stack. When popping Objects
+ from the stack rules must know what type the popped object is
+ expected to be. If multiple stacks were used it would be much
+ easier to just get the primitive from another stack rather than
+ popping the Object stack and casting the popped Object to the Object
+ wrapper just to extract the primitive value. Having a primitive
+ stack for each Java primitive type is a PITA for the developer, but
+ it simplifies things for the user. Plus theres the advantage of not
+ having to wrap and extract primitive types just to pass them around
+ in the stack.
+ </p>
+ </subsection>
+
+ <subsection name="Decoder Chaining/Stacking">
+ <p>
+ If one looks closely the digester contains a BERDecoder member. This
+ decoder is the source of all TLV Tuple events. The digester is
+ designed as a decoder itself that feeds incoming byte buffers to this
+ underlying BERDecoder. An inner class, DigesterCallback, is used to
+ implement a BERDecoderCallback. This inner class bridges BERDecoder
+ TLV Tuple events delivering them to the digester. The digester in
+ turn updates its tag nesting stack, and processes the new TLV
+ announced with the event. The digester detects the end of a
+ Protocol Data Unit (PDU) when it pops the last TLV tuple off of the
+ tagStack. When this happens the constructed object which should
+ have been popped off of the Object stack is returned using a
+ callback as the root of the containment tree.
+ </p>
+
+ <p>
+ The digester is a great example of how decoders can be chained or
+ stacked together in almost a linear combination. It's like stacking
+ filters on top of one another.
+ </p>
+ </subsection>
+
+ </section>
+ </body>
+</document>
Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderDesign.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,183 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>BEREncoder Design</title>
+ </properties>
+ <body>
+ <section name="Planning">
+ <p>
+ Trying to figure out how to keep encoder design simple, symetric and
+ extensible. At this point this section is really just a brain dump.
+ </p>
+
+ <subsection name="Layering Encoders">
+ <p>
+ It might be a good idea to separate encoder functionality into
+ separate encoder layers. This way we can isolation operations to
+ divide the responsibilities of each encoder keeping each encoder
+ simple. We are talking about stacking multiple encoders on top of
+ each other.
+ </p>
+
+ <p>
+ The most primitive teir could be an encoder concerned with writing
+ chunks out to a channel. On top of that may reside another encoder
+ that converts TLV Tuple Tag, Length and Value events into a stream
+ of chunked buffers. High level encoders can be designed to take
+ stub instances as and input to produce a stream of TLV tuple events.
+ These events are then pipped into the low level Tuple event encoder,
+ then written out as chunks to a channel. We really will not be
+ concerned with where the chunks go the event producer consumer
+ pipeline is the primary concern for us.
+ </p>
+
+ <p>
+ There are several benefits to this approach. Here's a few of them:
+ </p>
+
+ <ul>
+ <li>
+ low level tuple encoder is very simple
+ </li>
+ <li>
+ event streams generated by low level decoders can be piped
+ directly into low level encoders for round trip testing
+ </li>
+ <li>
+ low level tuple encoder can be very efficient and fast because of
+ its simplicity
+ </li>
+ <li>
+ the extra layer of encoding allows for the injection of other
+ facilities like tlv transformation or validation for DER
+ </li>
+ <li>
+ a lower primitive TLV event handling layer simplifies the design
+ and implementation of higher level encoders concerned with the
+ encoding of composite data types
+ </li>
+ <li>
+ the low level encoder isolates buffering and chunking related
+ aspects where other stacked encoders built upon it do not have to
+ be concerned with this
+ </li>
+ <li>
+ Stub specific code does not mix with generic tuple processing code.
+ </li>
+ </ul>
+
+ <p>
+ This low level TLV Tuple event stream encoder will need to receive
+ TLV events somehow. The encoder hence smust implement some kind of
+ callback. The fact that it is a callback makes it receive the
+ substrate this way rather than through the conventional encode()
+ pathway. What then happens to the idea of an encode() method that
+ maintains symetry with the decoder's decode() method?
+ </p>
+
+ <p>
+ The encode() method may still need to be used to turn the substrate
+ (TLV tuples) into bytes in a buffer. This may be called externally
+ as well as by the callback methods implemented by the encoder. The
+ callback methods might use encode() after or before performing some
+ house keeping operations. There are however wholes left wide open
+ with this approach which could interfer with tuple nesting. Basically
+ we will have two pathways for TLV delivery. First through the general
+ encode() method second through the TLV event delivery methods. Which
+ one is authoritative? Plus the TLV event methods are geared to handle
+ TLV nesting. Receiving a Tuple via the encode() method could
+ interfere with nesting and proper Tuple serialization into a buffer.
+ </p>
+
+ <p>
+ Perhaps to deal with these pitfalls we need to devise some standards
+ around usage. The encode() method could be implemented to only allow
+ the encoding of primitive Tuples. Still in there is the possibility
+ of interfering with nesting and encoding order as
+
+ Do we need all TLV events for the low level encoder?
+
+ </p>
+
+ <p>
+ For an exercise let's look at usage patterns from the top down. At
+ the topmost level an encoder will be devised to multiplex many
+ specific encoders. This top level encoder will map stub interfaces
+ to other stub specific encoders with knowledge of the ASN.1 data type
+ and the stub. When calls are made to encode(Object) the topmost
+ encoder will use the stub interface of the Object to lookup an
+ appropriate stub specific encoder. This topmost encoder hence
+ multiplexes other encoders based on the argument to encode. The stub
+ specific encoder recieves the substrate as the argument to the
+ encode(Object) method. It generates a series of TLV tuple events
+ using a special callback interface to convey the Tuple nesting
+ pattern. The standard EncoderCallback is not sufficient for this.
+ The topmost multiplexing encoder must recieve the callback events
+ of the stub specific encoders and call its own callback as if it
+ generated each event. The presence of the stub specific encoders is
+ transparent. Below this system the low level encoder recieves the
+ TLV tuple events generated and serializes them within buffers emitting
+ buffers as encoded objects to its own callback. Here's how this all
+ looks:
+ </p>
+ </subsection>
+ </section>
+
+ <section name="Determinate Length Encoder Design">
+ <subsection name="Problems and possible solutions">
+ <p>
+ Creating a determinate length encoder without sacrificing efficiency
+ is not easy. Making the code easy to manage and read is yet another
+ more difficult matter. This is already very difficult to manage with
+ the state machine approach we have taken. Furthermore we have found
+ many clients and servers which reject the use of the indeterminate
+ form even though according to the spec BER allows for such variance in
+ encodings.
+ </p>
+
+ <p>
+ Efficiency is difficult to achieve because we need to know the lengths
+ of nested (TLV) nodes to build constructed nodes with a determinate
+ length encoding. Since the topmost constructed TLV is the first out
+ the door we cannot transmit it until all nested TLV nodes have been
+ generated with their lengths already calculated. A brut force
+ approach might produce a TLV tree first before serializing the output
+ to a stream. This way all nested TLV's and up can have their length
+ fields computed depth first. This means keeping the entire transfer
+ image in memory as well as the structures needed to manage a tree.
+ Although DoS attacks are not as much of a concern for the encoding
+ phase as they are for decoding in a server, the approach would still
+ result in very inefficient encode operations especially when large
+ PDUs are being transmitted either by a server or a client. Plus there
+ is the fact that a PDU stub already exists with the same copy of the
+ information making it highly likely that more than 2 times the
+ transfer image will be required.
+ </p>
+
+ <p>
+ We must ask our selves if there is any way to avoid keeping the
+ entire transfer image in memory. Alan and Alex have discussed the
+ use of referrals to large binary data rather than keeping the data
+ in memory during codec operation. A referral would correspond to a
+ channel, or a stream to recall or store the data in question. This
+ way large binary values can be streamed from or to disk. Eventually
+ stubs will support these references although we do not have the
+ mechanism completely defined yet. If the same referrence can be held
+ in the place of the V field of a TLV then we can avoid having more
+ than 2X the transfer image in memory. This however will not be the
+ case when PDU's are tiny with small feilds well below the threshold
+ used to gauge when disk streaming is to occur. This is still however
+ one means to keep the in memory foot print down when PDU's with large
+ fields are encoded.
+ </p>
+
+ <p>
+
+ </p>
+ </subsection>
+
+
+ </section>
+ </body>
+</document>
Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/BEREncoderUserGuide.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,11 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>BEREncoder Usage</title>
+ </properties>
+ <body>
+ <section name="Coming soon ... ">
+ </section>
+ </body>
+</document>
Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/asn1berinfo.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,89 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>ASN.1 and BER Information</title>
+ </properties>
+ <body>
+
+ <section name="ASN.1 and BER Information">
+ <subsection name="Background">
+ <p>
+ The BER encoding for ASN.1 was defined within ITU specification
+ X.690 along with the Canonical and Distinguished Encoding Rules.
+ A copy of this document along with other useful documents and books
+ on ASN.1 and its encodings can be obtained for free here:
+ </p>
+ </subsection>
+
+ <subsection>
+ <table>
+ <tr><th>Document</th><th>Description</th></tr>
+ <tr>
+ <td>
+ <a href="http://lesbeshy.notlong.com">X.690 (07/02)</a>
+ </td>
+ <td>
+ Information technology - ASN.1 encoding rules: Specification of
+ Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and
+ Distinguished Encoding Rules (DER)
+ </td>
+ </tr>
+
+ <tr>
+ <td>
+ <a href="http://offburie.notlong.com">X.680 (07/02)</a>
+ </td>
+ <td>
+ Information technology - Abstract Syntax Notation One (ASN.1):
+ Specification of basic notation
+ </td>
+ </tr>
+
+ <tr>
+ <td>
+ <a href="http://www.oss.com/asn1/bookreg.html">ASN.1 Complete</a>
+ </td>
+ <td>
+ A verbose yet truely complete book on ASN.1 and various encoding
+ mechanisms. Easy to read since the author takes almost a
+ conversational tone.
+ </td>
+ </tr>
+
+ <tr>
+ <td>
+ <a href="http://www.oss.com/asn1/bookreg2.html">
+ ASN.1 - Communication between heterogeneous systems</a>
+ </td>
+ <td>
+ Also a very complete book on ASN.1 and various encoding mechanisms.
+ A little more difficult to read but seems to be much better
+ organized and more exacting. I use both books in conjunction
+ often switching between the two based on my mood :-). Both are
+ most excellent - thanks to both authors for graciously providing
+ their books online.
+ </td>
+ </tr>
+ </table>
+ </subsection>
+
+ <subsection name="BER Tag, Value, Length Tuples">
+ <p>
+ BER stands for Basic Encoding Rules. These rules describe how to
+ encode and decode basic data types and composite data structures
+ to and from TLV streams. A TLV hence may be primitive (atomic) or
+ constructed (nested) where the value component contains other TLVs.
+ The T is for the Tag a numeric type identifier, the L is for the
+ length of the data carried in the third V component, the value.
+ Outside of this very trivial introduction there is very little to
+ the encoding. Readers should look at the relatively short
+ specification for referrence regarding the exact encoding for
+ various data types using TLV tuples. The books above also have
+ appendices for the various encodings which are longer than the
+ actual specification yet more explanitory.
+ </p>
+ </subsection>
+ </section>
+ </body>
+</document>
\ No newline at end of file
Added: incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml
URL: http://svn.apache.org/viewcvs/incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml?view=auto&rev=154725
==============================================================================
--- incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml (added)
+++ incubator/directory/asn1/branches/rewrite/ber/xdocs/design.xml Mon Feb 21 13:44:46 2005
@@ -0,0 +1,168 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+ <properties>
+ <author email="akarasulu@apache.org">Alex Karasulu</author>
+ <title>Design Documentation</title>
+ </properties>
+ <body>
+ <section name="Introduction">
+ <p>
+ We're designing the decoder peice using the stateful decoder interfaces
+ that were carved up for the commons-codec API within their stateful
+ package. The stateful decoder is designed to maintain state while
+ processing arriving chunks of TLV data.
+ </p>
+
+ <p>
+ There are several issues confronted by the decoder while decoding a
+ stream of TLV trees where TLV tuples are nested within each other. In
+ general this decoder is viewed as a simple line parser for TLV tuples
+ notifying of their arrival via callbacks.
+ </p>
+
+ <p>
+ The BER encoder and decoder (codec) will be designed to operate very
+ much like the Simple API for XML (SAX). It will generate events which
+ are calls on a handler as it encounters low level encoded structures
+ called Tag Length Value (TLV) tuples.
+ </p>
+ <p>
+ Rather than return values, which could be extremely large, in one peice
+ the decoder for example returns peices of a value until it completes
+ processing the entire V of the TLV. This makes the decoder highly
+ attractive for servers using non-blocking IO and SocketChannels. This
+ design gives it a small decoding footprint regardless of the size of
+ the Protocol Data Unit (PDU) being processed. It also makes it much
+ faster since the decoder deals with a small simple task without much
+ conditional logic for processing a PDU. We hope the combined benefits
+ of non-blocking IO and this sleek codec, make any BER based protocol
+ server extremely responsive under heavy loads with massive concurrency.
+ </p>
+ </section>
+
+ <section name="Requirements">
+ <p>
+ The decoder must be fast, have a fixed memory footprint, and be simple.
+ It should perform only one task: notifying content handlers via
+ callbacks of the arrival of TLV tuples. While doing so it must maintain
+ state in between calls to decode a chunk of arriving BER encoded data.
+ </p>
+
+ <p>
+ It should not try to interpret the content of the TLV tuples. These
+ aspects are left to be handled by higher level content based facilities
+ that build on top of the BERDecoder. These higher facilities provide
+ their own callbacks to build on TLV events. The SnickersDecoder which
+ transforms ASN.1 BER TLV Tuples into messages uses the BERDecoder in
+ this way to give meaning to the arriving content.
+ </p>
+ </section>
+
+ <section name="Object Reuse and Using Primitive Types">
+ <p>
+ The density of TLV tuples encountered during decoder operation will
+ vary based on message characteristics. One of the most involved aspects
+ to the decoder is to spit out TLV tuples when emitting events.
+ </p>
+
+ <p>
+ We could just instantiate a new TLV tuple object for every tuple but
+ this would slow the decoder down and increase the memory footprint
+ making it less efficient. For this reason we decided to reuse the same
+ TLV tuple to deliver TLV data via notification events on the callback.
+ The callback implementation must copy the tuple's data if it intends to
+ save the TLV for use later. Otherwise the decoder will overwrite the
+ TLV members on the next event. We leave the option to copy the TLV
+ upto the higher level facility that way only those TLV tuples of
+ interest, known only to the content specific handler, can be copied.
+ <b>Why waste space and time on events that will are not of interest?</b>
+ </p>
+
+ <p>
+ The most complex part of the decoder deals with maintaining state while
+ decoding. Data can arrive at any time to contain any part of a TLV or
+ multiple TLVs along with parts of others. Often the fragmentation
+ signature to the data along with its size will not be known.
+ Furthermore the nesting of TLVs must be tracked while maintaining state.
+ A stack is used to track the nesting of TLV tuples within a TLV tree.
+ </p>
+
+ <p>
+ We do not instantiate TLV tuple objects so pushing the one TLV instance
+ we reuse is pointless. We could use two approaches here to handle this
+ issue. First we could just create a new instance only for those TLV
+ tuples that nest others and hence need to be pushed onto the stack. Or
+ we can use multiple primitive stacks based on an int to store the set
+ of values contained in the tuple. The second approach leads to greater
+ complexity while the first leads to some overhead in extra instantiation
+ time and memory which is negligable really. Which approach is best
+ depends on the number of members in the tuple or in otherwords the
+ number of primitive int stacks used.
+ </p>
+
+ <p>
+ We wrote a little test to figure out when one approach out performs the
+ other in the ObjectVersePrimitiveTest stress test. From tinkering
+ with the parameters of the test case we found the use of primitives to
+ out perform tuple object instantiation when the number of member stacks
+ is less than or equal to 2. If the number of stacks used is 3 or more
+ then instantiating a constructed TLV object and pushing it onto one
+ stack is a better strategy. In our case we have 3 peices of information
+ that need to be pushed and poped together so from this test data the
+ choice is clear. We clone the TLV tuple or instantiate a new one for
+ constructed TLVs that are pushed onto a single stack. This is faster
+ and removes the need to manage multiple stacks making the code less
+ complex.
+ </p>
+ </section>
+
+ <section name=" To be continued ... ">
+ <p>
+ More to come soon ...
+ </p>
+ </section>
+
+<!-- Some extra material
+ <p>
+ The decoder must be aware of it's current state. The following states are
+ possible in between TLV tuples:
+</p>
+<ol>
+ <li>
+ Composing Tag - The decoder starts in this mode. This is the mode for
+ collecting bytes for the Tag of the TLV in scope. Bytes are copied into
+ a temporary Tag byte buffer until the Tag data is complete. Once the
+ bytes are complete and the Tag int is generated along with a couple of other
+ values (boolean isPrimitive and TypeClassEnum), the Tag byte buffer is
+ cleared and the state switches to the composing length state. If the Tag
+ data is incomplete and not available until the next chunk of data arrives
+ via another decode call, we remain suspended in this state collecting the
+ tag bytes. There are no pushes onto any of the stacks during this state.
+ </li>
+ <li>
+ Composing Length - The decoder starts reading data into a Length byte
+ buffer until the Length data is complete whether the length data is the
+ short, long or indeterminate form. Once the bytes for the length are arrive
+ and the Length int is generated, the Length byte buffer is cleared. If
+ the Tag represents a primitive type the state switches to the composing
+ value state. If the TLV is constructed, the tag and the length are pushed
+ onto their respective stacks. The state then switches to the composing Tag
+ state. If the length data is incomplete and not available until the next
+ chunk of data arrives via another decode call, we remain suspended in this
+ state.
+ </li>
+ <li>
+ Composing Value - The decoder is in this mode because the TLV is
+ primitive. In this state only two forms of length are valid: the short
+ and long lengths. In either case we just read the number of bytes
+ specified for the length during the length composing state. If the value
+ data is incomplete and not avaiable until the next chunk of data arrives
+ via another decode call we remain suspended in this state. We transit
+ to the composing Tag state from this state when the value data arrives.
+ Before transiting to the composing Tag state the stacks are popped and
+ the TLV is delivered as a completion event to the callback.
+ </li>
+</ol>
+ -->
+ </body>
+</document>
\ No newline at end of file