You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@avro.apache.org by cu...@apache.org on 2010/03/04 00:09:54 UTC
svn commit: r918758 - in /hadoop/avro/trunk: CHANGES.txt doc/src/content/xdocs/spec.xml

Author: cutting
Date: Wed Mar  3 23:09:54 2010
New Revision: 918758

URL: http://svn.apache.org/viewvc?rev=918758&view=rev
Log:
AVRO-438.  Clarify spec.  Contributed by Amichai Rothman.

Modified:
    hadoop/avro/trunk/CHANGES.txt
    hadoop/avro/trunk/doc/src/content/xdocs/spec.xml

Modified: hadoop/avro/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/avro/trunk/CHANGES.txt?rev=918758&r1=918757&r2=918758&view=diff
==============================================================================
--- hadoop/avro/trunk/CHANGES.txt (original)
+++ hadoop/avro/trunk/CHANGES.txt Wed Mar  3 23:09:54 2010
@@ -10,6 +10,8 @@
     AVRO-439. Remove unused headers from being checked in configure.in
     (Bruce Mitchener via massie)
 
+    AVRO-438. Clarify spec.  (Amichai Rothman via cutting)
+
   BUG FIXES
 
     AVRO-424. Fix the specification of the deflate codec.

Modified: hadoop/avro/trunk/doc/src/content/xdocs/spec.xml
URL: http://svn.apache.org/viewvc/hadoop/avro/trunk/doc/src/content/xdocs/spec.xml?rev=918758&r1=918757&r2=918758&view=diff
==============================================================================
--- hadoop/avro/trunk/doc/src/content/xdocs/spec.xml (original)
+++ hadoop/avro/trunk/doc/src/content/xdocs/spec.xml Wed Mar  3 23:09:54 2010
@@ -58,14 +58,14 @@
         <title>Primitive Types</title>
         <p>The set of primitive type names is:</p>
         <ul>
-          <li><code>string</code>: unicode character sequence</li>
-          <li><code>bytes</code>: sequence of 8-bit unsigned bytes</li>
+          <li><code>null</code>: no value</li>
+          <li><code>boolean</code>: a binary value</li>
           <li><code>int</code>: 32-bit signed integer</li>
           <li><code>long</code>: 64-bit signed integer</li>
           <li><code>float</code>: single precision (32-bit) IEEE 754 floating-point number</li>
           <li><code>double</code>: double precision (64-bit) IEEE 754 floating-point number</li>
-          <li><code>boolean</code>: a binary value</li>
-          <li><code>null</code>: no value</li>
+          <li><code>bytes</code>: sequence of 8-bit unsigned bytes</li>
+          <li><code>string</code>: unicode character sequence</li>
         </ul>
         
         <p>Primitive types have no specified attributes.</p>
@@ -115,12 +115,12 @@
 		  <table class="right">
 		    <caption>field default values</caption>
 		    <tr><th>avro type</th><th>json type</th><th>example</th></tr>
-		    <tr><td>string</td><td>string</td><td>"foo"</td></tr>
-		    <tr><td>bytes</td><td>string</td><td>"\u00FF"</td></tr>
+		    <tr><td>null</td><td>null</td><td>null</td></tr>
+		    <tr><td>boolean</td><td>boolean</td><td>true</td></tr>
 		    <tr><td>int,long</td><td>integer</td><td>1</td></tr>
 		    <tr><td>float,double</td><td>number</td><td>1.1</td></tr>
-		    <tr><td>boolean</td><td>boolean</td><td>true</td></tr>
-		    <tr><td>null</td><td>null</td><td>null</td></tr>
+		    <tr><td>bytes</td><td>string</td><td>"\u00FF"</td></tr>
+		    <tr><td>string</td><td>string</td><td>"foo"</td></tr>
 		    <tr><td>record</td><td>object</td><td>{"a": 1}</td></tr>
 		    <tr><td>enum</td><td>string</td><td>"FOO"</td></tr>
 		    <tr><td>array</td><td>array</td><td>[1]</td></tr>
@@ -308,20 +308,10 @@
           <title>Primitive Types</title>
           <p>Primitive types are encoded in binary as follows:</p>
           <ul>
-            <li>a <code>string</code> is encoded as
-              a <code>long</code> followed by that many bytes of UTF-8
-              encoded character data.
-	      <p>For example, the three-character string "foo" would
-		be encoded as the long value 3 (encoded as
-		hex <code>06</code>) followed by the UTF-8 encoding of
-		'f', 'o', and 'o' (the hex bytes <code>66 6f
-		6f</code>):
-	      </p>
-	      <source>06 66 6f 6f</source>
-	    </li>
-            <li><code>bytes</code> are encoded as
-              a <code>long</code> followed by that many bytes of data.
-	    </li>
+            <li><code>null</code> is written as zero bytes.</li>
+            <li>a <code>boolean</code> is written as a single byte whose
+              value is either <code>0</code> (false) or <code>1</code>
+              (true).</li>
             <li><code>int</code> and <code>long</code> values are written
               using <a href="ext:vint">variable-length</a>
 	      <a href="ext:zigzag">zig-zag</a> coding.  Some examples:
@@ -347,10 +337,20 @@
               to <a href="http://java.sun.com/javase/6/docs/api/java/lang/Double.html#doubleToLongBits%28double%29">Java's
 		doubleToLongBits</a> and then encoded in little-endian
               format.</li>
-            <li>a <code>boolean</code> is written as a single byte whose
-              value is either <code>0</code> (false) or <code>1</code>
-              (true).</li>
-            <li><code>null</code> is written as zero bytes.</li>
+            <li><code>bytes</code> are encoded as
+              a <code>long</code> followed by that many bytes of data.
+            </li>
+            <li>a <code>string</code> is encoded as
+              a <code>long</code> followed by that many bytes of UTF-8
+              encoded character data.
+              <p>For example, the three-character string "foo" would
+              be encoded as the long value 3 (encoded as
+              hex <code>06</code>) followed by the UTF-8 encoding of
+              'f', 'o', and 'o' (the hex bytes <code>66 6f
+              6f</code>):
+              </p>
+              <source>06 66 6f 6f</source>
+            </li>
           </ul>
 
 	</section>
@@ -409,122 +409,113 @@
               count zero indicates the end of the array.  Each item is
               encoded per the array's item schema.</p>
 
-	    <p>If a block's count is negative, then the count is
-	      followed immediately by a <code>long</code>
-	      block <em>size</em>, indicating the number of bytes in the
-	      block.  The actual count in this case is the absolute value
-	      of the count written.</p>
-
-	    <p>For example, the array schema</p>
-	    <source>{"type": "array", "items": "long"}</source>
-	    <p>an array containing the items 3 and 27 could be encoded
-	      as the long value 2 (encoded as hex 04) followed by long
-	      values 3 and 27 (encoded as hex <code>06 36</code>)
-	      terminated by zero:</p>
-	    <source>04 06 36 00</source>
-
-	    <p>The blocked representation permits one to read and write
-	      arrays larger than can be buffered in memory, since one can
-	      start writing items without knowing the full length of the
-	      array.  The optional block sizes permit fast skipping
-	      through data, e.g., when projecting a record to a subset of
-	      its fields.</p>
+            <p>If a block's count is negative, its absolute value is used,
+              and the count is followed immediately by a <code>long</code>
+              block <em>size</em> indicating the number of bytes in the
+              block.  This block size permits fast skipping through data,
+              e.g., when projecting a record to a subset of its fields.</p>
+
+            <p>For example, the array schema</p>
+            <source>{"type": "array", "items": "long"}</source>
+            <p>an array containing the items 3 and 27 could be encoded
+              as the long value 2 (encoded as hex 04) followed by long
+              values 3 and 27 (encoded as hex <code>06 36</code>)
+              terminated by zero:</p>
+            <source>04 06 36 00</source>
+
+            <p>The blocked representation permits one to read and write
+              arrays larger than can be buffered in memory, since one can
+              start writing items without knowing the full length of the
+              array.</p>
 
-	  </section>
+          </section>
 
-          <section>
+	  <section>
             <title>Maps</title>
             <p>Maps are encoded as a series of <em>blocks</em>.  Each
               block consists of a <code>long</code> <em>count</em>
               value, followed by that many key/value pairs.  A block
               with count zero indicates the end of the map.  Each item
               is encoded per the map's value schema.</p>
-
-	    <p>If a block's count is negative, then the count is
-	      followed immediately by a <code>long</code>
-	      block <em>size</em>, indicating the number of bytes in the
-	      block.  The actual count in this case is the absolute value
-	      of the count written.</p>
-
-	    <p>The blocked representation permits one to read and write
-	      maps larger than can be buffered in memory, since one can
-	      start writing items without knowing the full length of the
-	      map.  The optional block sizes permit fast skipping through
-	      data, e.g., when projecting a record to a subset of its
-	      fields.</p>
-
-	    <p><em>NOTE: Blocking has not yet been fully implemented and
-		may change.  Arbitrarily large objects must be easily
-		writable and readable but until we have proven this with an
-		implementation and tests this part of the specification
-		should be considered draft.</em></p>
+	    
+            <p>If a block's count is negative, its absolute value is used,
+              and the count is followed immediately by a <code>long</code>
+              block <em>size</em> indicating the number of bytes in the
+              block.  This block size permits fast skipping through data,
+              e.g., when projecting a record to a subset of its fields.</p>
+	    
+            <p>The blocked representation permits one to read and write
+              maps larger than can be buffered in memory, since one can
+              start writing items without knowing the full length of the
+              map.</p>
+	
 	  </section>
 
           <section>
             <title>Unions</title>
-	    <p>A union is encoded by first writing a <code>long</code>
-	      value indicating the zero-based position within the
-	      union of the schema of its value.  The value is then
-	      encoded per the indicated schema within the union.</p>
-	    <p>For example, the union
-	      schema <code>["string","null"]</code> would encode:</p>
+            <p>A union is encoded by first writing a <code>long</code>
+              value indicating the zero-based position within the
+              union of the schema of its value.  The value is then
+              encoded per the indicated schema within the union.</p>
+            <p>For example, the union
+              schema <code>["string","null"]</code> would encode:</p>
             <ul>
               <li><code>null</code> as the integer 1 (the index of
-		"null" in the union, encoded as
-		hex <code>02</code>): <source>02</source></li>
+                "null" in the union, encoded as
+                hex <code>02</code>): <source>02</source></li>
               <li>the string <code>"a"</code> as zero (the index of
-		"string" in the union), followed by the serialized string:
-		<source>00 02 61</source></li>
+                "string" in the union), followed by the serialized string:
+                <source>00 02 61</source></li>
             </ul>
           </section>
 
           <section>
             <title>Fixed</title>
-	    <p>Fixed instances are encoded using the number of bytes
-	      declared in the schema.</p>
+            <p>Fixed instances are encoded using the number of bytes
+              declared in the schema.</p>
           </section>
 
-	</section> <!-- end complex types -->
+        </section> <!-- end complex types -->
 
       </section>
 
       <section id="json_encoding">
         <title>JSON Encoding</title>
-	
-	<p>Except for unions, the JSON encoding is the same as is used
-	to encode <a href="#schema_record">field default
-	values</a>.</p>
+        
+        <p>Except for unions, the JSON encoding is the same as is used
+        to encode <a href="#schema_record">field default
+        values</a>.</p>
 
-	<p>The value of a union is encoded in JSON as follows:</p>
+        <p>The value of a union is encoded in JSON as follows:</p>
 
-	<ul>
-	  <li>if its type is <code>null</code>, then it is encoded as
-	  a JSON null;</li>
-	  <li>otherwise it is encoded as a JSON object with one
-	  name/value pair whose name is the type's name and whose
-	  value is the recursively encoded value.  For Avro's named
-	  types (record, fixed or enum) the user-specified name is
-	  used, for other types the type name is used.</li>
-	</ul>
-	  
-	<p>For example, the union
-	  schema <code>["null","string","Foo"]</code>, where Foo is a
-	  record name, would encode:</p>
+        <ul>
+          <li>if its type is <code>null</code>, then it is encoded as
+          a JSON null;</li>
+          <li>otherwise it is encoded as a JSON object with one
+          name/value pair whose name is the type's name and whose
+          value is the recursively encoded value.  For Avro's named
+          types (record, fixed or enum) the user-specified name is
+          used, for other types the type name is used.</li>
+        </ul>
+          
+        <p>For example, the union
+          schema <code>["null","string","Foo"]</code>, where Foo is a
+          record name, would encode:</p>
         <ul>
           <li><code>null</code> as <code>null</code>;</li>
           <li>the string <code>"a"</code> as
-	    <code>{"string": "a"}</code>; and</li>
+            <code>{"string": "a"}</code>; and</li>
           <li>a Foo instance as <code>{"Foo": {...}}</code>,
           where <code>{...}</code> indicates the JSON encoding of a
           Foo instance.</li>
         </ul>
 
-	<p>Note that a schema is still required to correctly process
-	JSON-encoded data.  For example, the JSON encoding does not
-	distinguish between <code>int</code>
-	and <code>long</code>, <code>float</code>
-	and <code>double</code>, records and maps, enums and strings,
-	etc.</p>
+        <p>Note that a schema is still required to correctly process
+        JSON-encoded data.  For example, the JSON encoding does not
+        distinguish between <code>int</code>
+        and <code>long</code>, <code>float</code>
+        and <code>double</code>, records and maps, enums and strings,
+        etc.</p>
 
       </section>
 
@@ -534,59 +525,59 @@
       <title>Sort Order</title>
 
       <p>Avro defines a standard sort order for data.  This permits
-	data written by one system to be efficiently sorted by another
-	system.  This can be an important optimization, as sort order
-	comparisons are sometimes the most frequent per-object
-	operation.  Note also that Avro binary-encoded data can be
-	efficiently ordered without deserializing it to objects.</p>
+        data written by one system to be efficiently sorted by another
+        system.  This can be an important optimization, as sort order
+        comparisons are sometimes the most frequent per-object
+        operation.  Note also that Avro binary-encoded data can be
+        efficiently ordered without deserializing it to objects.</p>
 
       <p>Data items may only be compared if they have identical
-	schemas.  Pairwise comparisons are implemented recursively
-	with a depth-first, left-to-right traversal of the schema.
-	The first mismatch encountered determines the order of the
-	items.</p>
+        schemas.  Pairwise comparisons are implemented recursively
+        with a depth-first, left-to-right traversal of the schema.
+        The first mismatch encountered determines the order of the
+        items.</p>
 
       <p>Two items with the same schema are compared according to the
-	following rules.</p>
+        following rules.</p>
       <ul>
-	<li><code>int</code>, <code>long</code>, <code>float</code>
-	  and <code>double</code> data is ordered by ascending numeric
-	  value.</li>
-	<li><code>boolean</code> data is ordered with false before true.</li>
-	<li><code>null</code> data is always equal.</li>
-	<li><code>string</code> data is compared lexicographically by
-	  Unicode code point.  Note that since UTF-8 is used as the
-	  binary encoding for strings, sorting of bytes and string
-	  binary data is identical.</li>
-	<li><code>bytes</code> and <code>fixed</code> data are
-	  compared lexicographically by unsigned 8-bit values.</li>
-	<li><code>array</code> data is compared lexicographically by
-	  element.</li>
-	<li><code>enum</code> data is ordered by the symbol's position
-	  in the enum schema.  For example, an enum whose symbols are
-	  <code>["z", "a"]</code> would sort <code>"z"</code> values
-	  before <code>"a"</code> values.</li>
-	<li><code>union</code> data is first ordered by the branch
-	  within the union, and, within that, by the type of the
-	  branch.  For example, an <code>["int", "string"]</code>
-	  union would order all int values before all string values,
-	  with the ints and strings themselves ordered as defined
-	  above.</li>
-	<li><code>record</code> data is ordered lexicographically by
-	  field.  If a field specifies that its order is:
-	  <ul>
-	    <li><code>"ascending"</code>, then the order of its values
-	      is unaltered.</li>
-	    <li><code>"descending"</code>, then the order of its values
-	      is reversed.</li>
-	    <li><code>"ignore"</code>, then its values are ignored
-	      when sorting.</li>
-	  </ul>
-	</li>
-	<li><code>map</code> data may not be compared.  It is an error
-	  to attempt to compare data containing maps unless those maps
-	  are in an <code>"order":"ignore"</code> record field.
-	</li>
+        <li><code>null</code> data is always equal.</li>
+        <li><code>boolean</code> data is ordered with false before true.</li>
+        <li><code>int</code>, <code>long</code>, <code>float</code>
+          and <code>double</code> data is ordered by ascending numeric
+          value.</li>
+        <li><code>bytes</code> and <code>fixed</code> data are
+          compared lexicographically by unsigned 8-bit values.</li>
+        <li><code>string</code> data is compared lexicographically by
+          Unicode code point.  Note that since UTF-8 is used as the
+          binary encoding for strings, sorting of bytes and string
+          binary data is identical.</li>
+        <li><code>array</code> data is compared lexicographically by
+          element.</li>
+        <li><code>enum</code> data is ordered by the symbol's position
+          in the enum schema.  For example, an enum whose symbols are
+          <code>["z", "a"]</code> would sort <code>"z"</code> values
+          before <code>"a"</code> values.</li>
+        <li><code>union</code> data is first ordered by the branch
+          within the union, and, within that, by the type of the
+          branch.  For example, an <code>["int", "string"]</code>
+          union would order all int values before all string values,
+          with the ints and strings themselves ordered as defined
+          above.</li>
+        <li><code>record</code> data is ordered lexicographically by
+          field.  If a field specifies that its order is:
+          <ul>
+            <li><code>"ascending"</code>, then the order of its values
+              is unaltered.</li>
+            <li><code>"descending"</code>, then the order of its values
+              is reversed.</li>
+            <li><code>"ignore"</code>, then its values are ignored
+              when sorting.</li>
+          </ul>
+        </li>
+        <li><code>map</code> data may not be compared.  It is an error
+          to attempt to compare data containing maps unless those maps
+          are in an <code>"order":"ignore"</code> record field.
+        </li>
       </ul>
     </section>
 
@@ -594,39 +585,39 @@
       <title>Object Container Files</title>
       <p>Avro includes a simple object container file format.  A file
       has a schema, and all objects stored in the file must be written
-      according to that schema.  Objects are stored in blocks that may
-      be compressed.  Syncronization markers are used between blocks
-      to permit efficient splitting of files for MapReduce
-      processing.</p>
+      according to that schema, using binary encoding.  Objects are
+      stored in blocks that may be compressed.  Syncronization markers
+      are used between blocks to permit efficient splitting of files
+      for MapReduce processing.</p>
 
       <p>Files may include arbitrary user-specified metadata.</p>
 
       <p>A file consists of:</p>
       <ul>
-	<li>A <em>file header</em>, followed by</li>
-	<li>one or more <em>file data blocks</em>.</li>
+        <li>A <em>file header</em>, followed by</li>
+        <li>one or more <em>file data blocks</em>.</li>
       </ul>
 
       <p>A file header consists of:</p>
       <ul>
-	<li>Four bytes, ASCII 'O', 'b', 'j', followed by 1.</li>
-	<li><em>file metadata</em>, including the schema.</li>
-	<li>The 16-byte, randomly-generated sync marker for this file.</li>
+        <li>Four bytes, ASCII 'O', 'b', 'j', followed by 1.</li>
+        <li><em>file metadata</em>, including the schema.</li>
+        <li>The 16-byte, randomly-generated sync marker for this file.</li>
       </ul>
 
       <p>File metadata consists of:</p>
       <ul>
-	<li>A long indicating the number of metadata key/value pairs.</li>
-	<li>For each pair, a string key and bytes value.</li>
+        <li>A long indicating the number of metadata key/value pairs.</li>
+        <li>For each pair, a string key and bytes value.</li>
       </ul>
 
       <p>All metadata properties that start with "avro." are reserved.
       The following file metadata properties are currently used:</p>
       <ul>
-	<li><strong>avro.schema</strong> contains the schema of objects
-	stored in the file, as JSON data (required).</li>
-	<li><strong>avro.codec</strong> the name of the compression codec
-	used to compress blocks, as a string.  Implementations
+        <li><strong>avro.schema</strong> contains the schema of objects
+        stored in the file, as JSON data (required).</li>
+        <li><strong>avro.codec</strong> the name of the compression codec
+        used to compress blocks, as a string.  Implementations
         are required to support the following codecs: "null" and "deflate".  
         If codec is absent, it is assumed to be "null".  The codecs
         are described with more detail below.</li>
@@ -645,16 +636,16 @@
 
       <p>A file data block consists of:</p>
       <ul>
-	<li>A long indicating the count of objects in this block.</li>
-	<li>A long indicating the size in bytes of the serialized objects
-	in the current block, after any codec is applied</li>
-	<li>The serialized objects.  If a codec is specified, this is
-	compressed by that codec.</li>
-	<li>The file's 16-byte sync marker.</li>
+        <li>A long indicating the count of objects in this block.</li>
+        <li>A long indicating the size in bytes of the serialized objects
+        in the current block, after any codec is applied</li>
+        <li>The serialized objects.  If a codec is specified, this is
+        compressed by that codec.</li>
+        <li>The file's 16-byte sync marker.</li>
       </ul>
-	  <p>Thus, each block's binary data can be efficiently extracted or skipped without
-	  deserializing the contents.  The combination of block size, object counts, and 
-	  sync markers enable detection of corrupt blocks and help ensure data integrity.</p>
+          <p>Thus, each block's binary data can be efficiently extracted or skipped without
+          deserializing the contents.  The combination of block size, object counts, and 
+          sync markers enable detection of corrupt blocks and help ensure data integrity.</p>
       <section>
       <title>Required Codecs</title>
         <section>
@@ -683,42 +674,44 @@
 
       <p>A protocol is a JSON object with the following attributes:</p>
       <ul>
-	<li><em>protocol</em>, a string, the name of the protocol
-	(required);</li>
-	<li><em>namespace</em>, an optional string that qualifies the name;</li>
-	<li><em>doc</em>, an optional string describing this protocol;</li>
-	<li><em>types</em>, an optional list of definitions of named types
-	  (records, enums, fixed and errors).  An error definition is
-	  just like a record definition except it uses "error" instead
-	  of "record".  Note that forward references to named types
-	  are not permitted.</li>
-	<li><em>messages</em>, an optional JSON object whose keys are
-	  message names and whose values are objects whose attributes
-	  are described below.  No two messages may have the same
-	  name.</li>
+        <li><em>protocol</em>, a string, the name of the protocol
+        (required);</li>
+        <li><em>namespace</em>, an optional string that qualifies the name;</li>
+        <li><em>doc</em>, an optional string describing this protocol;</li>
+        <li><em>types</em>, an optional list of definitions of named types
+          (records, enums, fixed and errors).  An error definition is
+          just like a record definition except it uses "error" instead
+          of "record".  Note that forward references to named types
+          are not permitted.</li>
+        <li><em>messages</em>, an optional JSON object whose keys are
+          message names and whose values are objects whose attributes
+          are described below.  No two messages may have the same
+          name.</li>
       </ul>
+      <p>The name and namespace qualification rules defined for schema objects
+	apply to protocols as well.</p>
 
       <section>
-	<title>Messages</title>
-	<p>A message has attributes:</p>
-	<ul>
+        <title>Messages</title>
+        <p>A message has attributes:</p>
+        <ul>
           <li>a <em>doc</em>, an optional description of the message,</li>
-	  <li>a <em>request</em>, a list of named,
-	    typed <em>parameter</em> schemas (this has the same form
-	    as the fields of a record declaration);</li>
-	  <li>a <em>response</em> schema; and</li> 
-	  <li>an optional union of <em>error</em> schemas.</li>
-	</ul>
-	<p>A request parameter list is processed equivalently to an
-	  anonymous record.  Since record field lists may vary between
-	  reader and writer, request parameters may also differ
-	  between the caller and responder, and such differences are
-	  resolved in the same manner as record field differences.</p>
+          <li>a <em>request</em>, a list of named,
+            typed <em>parameter</em> schemas (this has the same form
+            as the fields of a record declaration);</li>
+          <li>a <em>response</em> schema; and</li> 
+          <li>an optional union of <em>error</em> schemas.</li>
+        </ul>
+        <p>A request parameter list is processed equivalently to an
+          anonymous record.  Since record field lists may vary between
+          reader and writer, request parameters may also differ
+          between the caller and responder, and such differences are
+          resolved in the same manner as record field differences.</p>
       </section>
       <section>
-	<title>Sample Protocol</title>
-	<p>For example, one may define a simple HelloWorld protocol with:</p>
-	<source>
+        <title>Sample Protocol</title>
+        <p>For example, one may define a simple HelloWorld protocol with:</p>
+        <source>
 {
   "namespace": "com.acme",
   "protocol": "HelloWorld",
@@ -740,7 +733,7 @@
     }
   }
 }
-	</source>
+        </source>
       </section>
     </section>
 
@@ -748,101 +741,101 @@
       <title>Protocol Wire Format</title>
 
       <section>
-	<title>Message Transport</title>
-	<p>Messages may be transmitted via
-	different <em>transport</em> mechanisms.</p>
+        <title>Message Transport</title>
+        <p>Messages may be transmitted via
+        different <em>transport</em> mechanisms.</p>
 
-	<p>To the transport, a <em>message</em> is an opaque byte sequence.</p>
+        <p>To the transport, a <em>message</em> is an opaque byte sequence.</p>
 
-	<p>A transport is a system that supports:</p>
-	<ul>
-	  <li><strong>transmission of request messages</strong>
-	  </li>
-	  <li><strong>receipt of corresponding response messages</strong>
-	    <p>Servers will send a response message back to the client
-	    corresponding to each request message.  The mechanism of
-	    that correspondance is transport-specific.  For example,
-	    in HTTP it might be implicit, since HTTP directly supports
-	    requests and responses.  But a transport that multiplexes
-	    many client threads over a single socket would need to tag
-	    messages with unique identifiers.</p>
-	  </li>
-	</ul>
+        <p>A transport is a system that supports:</p>
+        <ul>
+          <li><strong>transmission of request messages</strong>
+          </li>
+          <li><strong>receipt of corresponding response messages</strong>
+            <p>Servers will send a response message back to the client
+            corresponding to each request message.  The mechanism of
+            that correspondance is transport-specific.  For example,
+            in HTTP it might be implicit, since HTTP directly supports
+            requests and responses.  But a transport that multiplexes
+            many client threads over a single socket would need to tag
+            messages with unique identifiers.</p>
+          </li>
+        </ul>
 
-	<section>
-	  <title>HTTP as Transport</title>
-	  <p>When
-	    <a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html">HTTP</a>
-	    is used as a transport, each Avro message exchange is an
-	    HTTP request/response pair.  All messages of an Avro
-	    protocol should share a single URL at an HTTP server.
-	    Other protocols may also use that URL.  Both normal and
-	    error Avro response messages should use the 200 (OK)
-	    response code.  The chunked encoding may be used for
-	    requests and responses, but, regardless the Avro request
-	    and response are the entire content of an HTTP request and
-	    response.  The HTTP Content-Type of requests and responses
-	    should be specified as "avro/binary".  Requests should be
-	    made using the POST method.</p>
-	</section>
+        <section>
+          <title>HTTP as Transport</title>
+          <p>When
+            <a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html">HTTP</a>
+            is used as a transport, each Avro message exchange is an
+            HTTP request/response pair.  All messages of an Avro
+            protocol should share a single URL at an HTTP server.
+            Other protocols may also use that URL.  Both normal and
+            error Avro response messages should use the 200 (OK)
+            response code.  The chunked encoding may be used for
+            requests and responses, but, regardless the Avro request
+            and response are the entire content of an HTTP request and
+            response.  The HTTP Content-Type of requests and responses
+            should be specified as "avro/binary".  Requests should be
+            made using the POST method.</p>
+        </section>
       </section>
 
       <section>
-	<title>Message Framing</title>
-	<p>Avro messages are <em>framed</em> as a list of buffers.</p>
-	<p>Framing is a layer between messages and the transport.
-	It exists to optimize certain operations.</p>
+        <title>Message Framing</title>
+        <p>Avro messages are <em>framed</em> as a list of buffers.</p>
+        <p>Framing is a layer between messages and the transport.
+        It exists to optimize certain operations.</p>
 
-	<p>The format of framed message data is:</p>
-	<ul>
-	  <li>a series of <em>buffers</em>, where each buffer consists of:
-	    <ul>
-	      <li>a four-byte, big-endian <em>buffer length</em>, followed by</li>
-	      <li>that many bytes of <em>buffer data</em>.</li>
-	    </ul>
-	  </li>
-	  <li>A message is always terminated by a zero-lenghted buffer.</li>
-	</ul>
+        <p>The format of framed message data is:</p>
+        <ul>
+          <li>a series of <em>buffers</em>, where each buffer consists of:
+            <ul>
+              <li>a four-byte, big-endian <em>buffer length</em>, followed by</li>
+              <li>that many bytes of <em>buffer data</em>.</li>
+            </ul>
+          </li>
+          <li>A message is always terminated by a zero-lenghted buffer.</li>
+        </ul>
 
-	<p>Framing is transparent to request and response message
-	formats (described below).  Any message may be presented as a
-	single or multiple buffers.</p>
-
-	<p>Framing can permit readers to more efficiently get
-	different buffers from different sources and for writers to
-	more efficiently store different buffers to different
-	destinations.  In particular, it can reduce the number of
-	times large binary objects are copied.  For example, if an RPC
-	parameter consists of a megabyte of file data, that data can
-	be copied directly to a socket from a file descriptor, and, on
-	the other end, it could be written directly to a file
-	descriptor, never entering user space.</p>
-
-	<p>A simple, recommended, framing policy is for writers to
-	create a new segment whenever a single binary object is
-	written that is larger than a normal output buffer.  Small
-	objects are then appended in buffers, while larger objects are
-	written as their own buffers.  When a reader then tries to
-	read a large object the runtime can hand it an entire buffer
-	directly, without having to copy it.</p>
+        <p>Framing is transparent to request and response message
+        formats (described below).  Any message may be presented as a
+        single or multiple buffers.</p>
+
+        <p>Framing can permit readers to more efficiently get
+        different buffers from different sources and for writers to
+        more efficiently store different buffers to different
+        destinations.  In particular, it can reduce the number of
+        times large binary objects are copied.  For example, if an RPC
+        parameter consists of a megabyte of file data, that data can
+        be copied directly to a socket from a file descriptor, and, on
+        the other end, it could be written directly to a file
+        descriptor, never entering user space.</p>
+
+        <p>A simple, recommended, framing policy is for writers to
+        create a new segment whenever a single binary object is
+        written that is larger than a normal output buffer.  Small
+        objects are then appended in buffers, while larger objects are
+        written as their own buffers.  When a reader then tries to
+        read a large object the runtime can hand it an entire buffer
+        directly, without having to copy it.</p>
       </section>
 
       <section>
-	<title>Handshake</title>
+        <title>Handshake</title>
 
-	<p>RPC requests and responses are prefixed by handshakes.  The
-	purpose of the handshake is to ensure that the client and the
-	server have each other's protocol definition, so that the
-	client can correctly deserialize responses, and the server can
-	correctly deserialize requests.  Both clients and servers
-	should maintain a cache of recently seen protocols, so that,
-	in most cases, a handshake will be completed without extra
-	round-trip network exchanges or the transmission of full
-	protocol text.</p>
+        <p>RPC requests and responses are prefixed by handshakes.  The
+        purpose of the handshake is to ensure that the client and the
+        server have each other's protocol definition, so that the
+        client can correctly deserialize responses, and the server can
+        correctly deserialize requests.  Both clients and servers
+        should maintain a cache of recently seen protocols, so that,
+        in most cases, a handshake will be completed without extra
+        round-trip network exchanges or the transmission of full
+        protocol text.</p>
 
-	<p>The handshake process uses the following record schemas:</p>
+        <p>The handshake process uses the following record schemas:</p>
 
-	<source>
+        <source>
 {
   "type": "record",
   "name": "HandshakeRequest", "namespace":"org.apache.avro.ipc",
@@ -860,7 +853,7 @@
   "fields": [
     {"name": "match",
      "type": {"type": "enum", "name": "HandshakeMatch",
-	      "symbols": ["BOTH", "CLIENT", "NONE"]}},
+              "symbols": ["BOTH", "CLIENT", "NONE"]}},
     {"name": "serverProtocol",
      "type": ["null", "string"]},
     {"name": "serverHash",
@@ -869,92 +862,92 @@
      "type": ["null", {"type": "map", "values": "bytes"}]}
   ]
 }
-	</source>
+        </source>
 
         <ul>
-	  <li>A client first prefixes each request with
-	  a <code>HandshakeRequest</code> containing just the hash of
-	  its protocol and of the server's protocol
-	  (<code>clientHash!=null, clientProtocol=null,
-	  serverHash!=null</code>), where the hashes are 128-bit MD5
-	  hashes of the JSON protocol text. If a client has never
-	  connected to a given server, it sends its hash as a guess of
-	  the server's hash, otherwise it sends the hash that it
-	  previously obtained from this server.</li>
-
-	  <li>The server responds with
-	  a <code>HandshakeResponse</code> containing one of:
-	    <ul>
-	      <li><code>match=BOTH, serverProtocol=null,
-	      serverHash=null</code> if the client sent the valid hash
-	      of the server's protocol and the server knows what
-	      protocol corresponds to the client's hash. In this case,
-	      the request is complete and the response data
-	      immediately follows the HandshakeResponse.</li>
-
-	      <li><code>match=CLIENT, serverProtocol!=null,
-	      serverHash!=null</code> if the server has previously
-	      seen the client's protocol, but the client sent an
-	      incorrect hash of the server's protocol. The request is
-	      complete and the response data immediately follows the
-	      HandshakeResponse. The client must use the returned
-	      protocol to process the response and should also cache
-	      that protocol and its hash for future interactions with
-	      this server.</li>
+          <li>A client first prefixes each request with
+          a <code>HandshakeRequest</code> containing just the hash of
+          its protocol and of the server's protocol
+          (<code>clientHash!=null, clientProtocol=null,
+          serverHash!=null</code>), where the hashes are 128-bit MD5
+          hashes of the JSON protocol text. If a client has never
+          connected to a given server, it sends its hash as a guess of
+          the server's hash, otherwise it sends the hash that it
+          previously obtained from this server.</li>
+
+          <li>The server responds with
+          a <code>HandshakeResponse</code> containing one of:
+            <ul>
+              <li><code>match=BOTH, serverProtocol=null,
+              serverHash=null</code> if the client sent the valid hash
+              of the server's protocol and the server knows what
+              protocol corresponds to the client's hash. In this case,
+              the request is complete and the response data
+              immediately follows the HandshakeResponse.</li>
+
+              <li><code>match=CLIENT, serverProtocol!=null,
+              serverHash!=null</code> if the server has previously
+              seen the client's protocol, but the client sent an
+              incorrect hash of the server's protocol. The request is
+              complete and the response data immediately follows the
+              HandshakeResponse. The client must use the returned
+              protocol to process the response and should also cache
+              that protocol and its hash for future interactions with
+              this server.</li>
 
               <li><code>match=NONE</code> if the server has not
-	      previously seen the client's protocol.
-	      The <code>serverHash</code>
-	      and <code>serverProtocol</code> may also be non-null if
-	      the server's protocol hash was incorrect.
-
-	      <p>In this case the client must then re-submit its request
-	      with its protocol text (<code>clientHash!=null,
-	      clientProtocol!=null, serverHash!=null</code>) and the
-	      server should respond with a successful match
-	      (<code>match=BOTH, serverProtocol=null,
-	      serverHash=null</code>) as above.</p>
-	      </li>
-	    </ul>
-	  </li>
-	</ul>
+              previously seen the client's protocol.
+              The <code>serverHash</code>
+              and <code>serverProtocol</code> may also be non-null if
+              the server's protocol hash was incorrect.
+
+              <p>In this case the client must then re-submit its request
+              with its protocol text (<code>clientHash!=null,
+              clientProtocol!=null, serverHash!=null</code>) and the
+              server should respond with a successful match
+              (<code>match=BOTH, serverProtocol=null,
+              serverHash=null</code>) as above.</p>
+              </li>
+            </ul>
+          </li>
+        </ul>
 
-	<p>The <code>meta</code> field is reserved for future
-	handshake enhancements.</p>
+        <p>The <code>meta</code> field is reserved for future
+        handshake enhancements.</p>
 
       </section>
 
       <section>
-	<title>Call Format</title>
-	<p>A <em>call</em> consists of a request message paired with
-	its resulting response or error message.  Requests and
-	responses contain extensible metadata, and both kinds of
-	messages are framed as described above.</p>
+        <title>Call Format</title>
+        <p>A <em>call</em> consists of a request message paired with
+        its resulting response or error message.  Requests and
+        responses contain extensible metadata, and both kinds of
+        messages are framed as described above.</p>
 
-	<p>The format of a call request is:</p>
-	<ul>
-	  <li><em>request metadata</em>, a map with values of
-	  type <code>bytes</code></li>
-	  <li>the <em>message name</em>, an Avro string,
-	  followed by</li>
-	  <li>the message <em>parameters</em>.  Parameters are
-	  serialized according to the message's request
-	  declaration.</li>
-	</ul>
+        <p>The format of a call request is:</p>
+        <ul>
+          <li><em>request metadata</em>, a map with values of
+          type <code>bytes</code></li>
+          <li>the <em>message name</em>, an Avro string,
+          followed by</li>
+          <li>the message <em>parameters</em>.  Parameters are
+          serialized according to the message's request
+          declaration.</li>
+        </ul>
 
-	<p>The format of a call response is:</p>
-	<ul>
-	  <li><em>response metadata</em>, a map with values of
-	  type <code>bytes</code></li>
-	  <li>a one-byte <em>error flag</em> boolean, followed by either:
-	    <ul>
-	      <li>if the error flag is false, the message <em>response</em>,
-		serialized per the message's response schema.</li>
-	      <li>if the error flag is true, the <em>error</em>,
-	      serialized per the message's error union schema.</li>
-	    </ul>
-	  </li>
-	</ul>
+        <p>The format of a call response is:</p>
+        <ul>
+          <li><em>response metadata</em>, a map with values of
+          type <code>bytes</code></li>
+          <li>a one-byte <em>error flag</em> boolean, followed by either:
+            <ul>
+              <li>if the error flag is false, the message <em>response</em>,
+                serialized per the message's response schema.</li>
+              <li>if the error flag is true, the <em>error</em>,
+              serialized per the message's error union schema.</li>
+            </ul>
+          </li>
+        </ul>
       </section>
 
     </section>
@@ -963,96 +956,96 @@
       <title>Schema Resolution</title>
 
       <p>A reader of Avro data, whether from an RPC or a file, can
-	always parse that data because its schema is provided.  But
-	that schema may not be exactly the schema that was expected.
-	For example, if the data was written with a different version
-	of the software than it is read, then records may have had
-	fields added or removed.  This section specifies how such
-	schema differences should be resolved.</p>
+        always parse that data because its schema is provided.  But
+        that schema may not be exactly the schema that was expected.
+        For example, if the data was written with a different version
+        of the software than it is read, then records may have had
+        fields added or removed.  This section specifies how such
+        schema differences should be resolved.</p>
 
       <p>We call the schema used to write the data as
-	the <em>writer's</em> schema, and the schema that the
-	application expects the <em>reader's</em> schema.  Differences
-	between these should be resolved as follows:</p>
+        the <em>writer's</em> schema, and the schema that the
+        application expects the <em>reader's</em> schema.  Differences
+        between these should be resolved as follows:</p>
 
       <ul>
-	<li><p>It is an error if the two schemas do not <em>match</em>.</p>
-	  <p>To match, one of the following must hold:</p>
-	  <ul>
-	    <li>both schemas are arrays whose item types match</li>
-	    <li>both schemas are maps whose value types match</li>
-	    <li>both schemas are enums whose names match</li>
-	    <li>both schemas are fixed whose sizes and names match</li>
-	    <li>both schemas are records with the same name</li>
-	    <li>either schema is a union</li>
-	    <li>both schemas have same primitive type</li>
-	    <li>the writer's schema may be <em>promoted</em> to the
-	      reader's as follows:
-	      <ul>
-		<li>int is promotable to long, float, or double</li>
-		<li>long is promotable to float or double</li>
-		<li>float is promotable to double</li>
-		</ul>
-	    </li>
-	  </ul>
-	</li>
+        <li><p>It is an error if the two schemas do not <em>match</em>.</p>
+          <p>To match, one of the following must hold:</p>
+          <ul>
+            <li>both schemas are arrays whose item types match</li>
+            <li>both schemas are maps whose value types match</li>
+            <li>both schemas are enums whose names match</li>
+            <li>both schemas are fixed whose sizes and names match</li>
+            <li>both schemas are records with the same name</li>
+            <li>either schema is a union</li>
+            <li>both schemas have same primitive type</li>
+            <li>the writer's schema may be <em>promoted</em> to the
+              reader's as follows:
+              <ul>
+                <li>int is promotable to long, float, or double</li>
+                <li>long is promotable to float or double</li>
+                <li>float is promotable to double</li>
+                </ul>
+            </li>
+          </ul>
+        </li>
 
-	<li><strong>if both are records:</strong>
-	  <ul>
-	    <li>the ordering of fields may be different: fields are
+        <li><strong>if both are records:</strong>
+          <ul>
+            <li>the ordering of fields may be different: fields are
               matched by name.</li>
-	    
-	    <li>schemas for fields with the same name in both records
-	      are resolved recursively.</li>
-	    
-	    <li>if the writer's record contains a field with a name
-	      not present in the reader's record, the writer's value
-	      for that field is ignored.</li>
-	    
-	    <li>if the reader's record schema has a field that
+            
+            <li>schemas for fields with the same name in both records
+              are resolved recursively.</li>
+            
+            <li>if the writer's record contains a field with a name
+              not present in the reader's record, the writer's value
+              for that field is ignored.</li>
+            
+            <li>if the reader's record schema has a field that
               contains a default value, and writer's schema does not
               have a field with the same name, then the reader should
               use the default value from its field.</li>
 
-	    <li>if the reader's record schema has a field with no
+            <li>if the reader's record schema has a field with no
               default value, and writer's schema does not have a field
               with the same name, an error is signalled.</li>
-	  </ul>
-	</li>
+          </ul>
+        </li>
 
-	<li><strong>if both are enums:</strong>
-	  <p>if the writer's symbol is not present in the reader's
-	    enum, then an error is signalled.</p>
-	</li>
-
-	<li><strong>if both are arrays:</strong>
-	  <p>This resolution algorithm is applied recursively to the reader's and
-	    writer's array item schemas.</p>
-	</li>
-
-	<li><strong>if both are maps:</strong>
-	  <p>This resolution algorithm is applied recursively to the reader's and
-	    writer's value schemas.</p>
-	</li>
-
-	<li><strong>if both are unions:</strong>
-	  <p>The first schema in the reader's union that matches the
-	    selected writer's union schema is recursively resolved
-	    against it.  if none match, an error is signalled.</p>
-	</li>
-
-	<li><strong>if reader's is a union, but writer's is not</strong>
-	  <p>The first schema in the reader's union that matches the
-	    writer's schema is recursively resolved against it.  If none
-	    match, an error is signalled.</p>
-	</li>
-	  
-	<li><strong>if writer's is a union, but reader's is not</strong>
-	  <p>If the reader's schema matches the selected writer's schema,
-	    it is recursively resolved against it.  If they do not
-	    match, an error is signalled.</p>
-	</li>
-	  
+        <li><strong>if both are enums:</strong>
+          <p>if the writer's symbol is not present in the reader's
+            enum, then an error is signalled.</p>
+        </li>
+
+        <li><strong>if both are arrays:</strong>
+          <p>This resolution algorithm is applied recursively to the reader's and
+            writer's array item schemas.</p>
+        </li>
+
+        <li><strong>if both are maps:</strong>
+          <p>This resolution algorithm is applied recursively to the reader's and
+            writer's value schemas.</p>
+        </li>
+
+        <li><strong>if both are unions:</strong>
+          <p>The first schema in the reader's union that matches the
+            selected writer's union schema is recursively resolved
+            against it.  if none match, an error is signalled.</p>
+        </li>
+
+        <li><strong>if reader's is a union, but writer's is not</strong>
+          <p>The first schema in the reader's union that matches the
+            writer's schema is recursively resolved against it.  If none
+            match, an error is signalled.</p>
+        </li>
+          
+        <li><strong>if writer's is a union, but reader's is not</strong>
+          <p>If the reader's schema matches the selected writer's schema,
+            it is recursively resolved against it.  If they do not
+            match, an error is signalled.</p>
+        </li>
+          
       </ul>
 
       <p>A schema's "doc" fields are ignored for the purposes of schema resolution.  Hence,