You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@avro.apache.org by cu...@apache.org on 2009/08/28 18:31:20 UTC
svn commit: r808946 - in /hadoop/avro/trunk: CHANGES.txt
src/doc/content/xdocs/spec.xml
Author: cutting
Date: Fri Aug 28 16:31:19 2009
New Revision: 808946
URL: http://svn.apache.org/viewvc?rev=808946&view=rev
Log:
AVRO-92. Describe JSON data encoding in specification.
Modified:
hadoop/avro/trunk/CHANGES.txt
hadoop/avro/trunk/src/doc/content/xdocs/spec.xml
Modified: hadoop/avro/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/avro/trunk/CHANGES.txt?rev=808946&r1=808945&r2=808946&view=diff
==============================================================================
--- hadoop/avro/trunk/CHANGES.txt (original)
+++ hadoop/avro/trunk/CHANGES.txt Fri Aug 28 16:31:19 2009
@@ -14,6 +14,9 @@
AVRO-104. Permit null fields in Java reflection.
(Eelco Hillenius via cutting)
+ AVRO-92. Describe JSON data encoding in specification
+ document. (cutting)
+
IMPROVEMENTS
AVRO-71. C++: make deserializer more generic. (Scott Banachowski
Modified: hadoop/avro/trunk/src/doc/content/xdocs/spec.xml
URL: http://svn.apache.org/viewvc/hadoop/avro/trunk/src/doc/content/xdocs/spec.xml?rev=808946&r1=808945&r2=808946&view=diff
==============================================================================
--- hadoop/avro/trunk/src/doc/content/xdocs/spec.xml (original)
+++ hadoop/avro/trunk/src/doc/content/xdocs/spec.xml Fri Aug 28 16:31:19 2009
@@ -79,7 +79,7 @@
<p>Avro supports six kinds of complex types: records, enums,
arrays, maps, unions and fixed.</p>
- <section>
+ <section id="schema_record">
<title>Records</title>
<p>Records use the type name "record" and support two attributes:</p>
@@ -236,185 +236,241 @@
a depth-first, left-to-right traversal of the schema,
serializing primitive types as they are encountered.</p>
- <section id="serialize_primitive">
- <title>Primitive Type Serialization</title>
- <p>Primitive types are serialized as follows:</p>
- <ul>
- <li>a <code>string</code> is serialized as
- a <code>long</code> followed by that many bytes of UTF-8
- encoded character data.
- <p>For example, the three-character
- string "foo" would be serialized as 3 (encoded as
- hex <code>06</code>) followed by the UTF-8 encoding of
- 'f', 'o', and 'o' (the hex bytes <code>66 6f 6f</code>):
- </p>
- <source>06 66 6f 6f</source>
- </li>
- <li><code>bytes</code> are serialized as
- a <code>long</code> followed by that many bytes of data.
- </li>
- <li><code>int</code> and <code>long</code> values are written
- using <a href="ext:vint">variable-length</a>
- <a href="ext:zigzag">zig-zag</a> coding. Some examples:
- <table class="right">
- <tr><th>value</th><th>hex</th></tr>
- <tr><td><code> 0</code></td><td><code>00</code></td></tr>
- <tr><td><code>-1</code></td><td><code>01</code></td></tr>
- <tr><td><code> 1</code></td><td><code>02</code></td></tr>
- <tr><td><code>-2</code></td><td><code>03</code></td></tr>
- <tr><td><code> 2</code></td><td><code>04</code></td></tr>
- <tr><td colspan="2"><code>...</code></td></tr>
- <tr><td><code>-64</code></td><td><code>7f</code></td></tr>
- <tr><td><code> 64</code></td><td><code> 80 01</code></td></tr>
- <tr><td colspan="2"><code>...</code></td></tr>
- </table>
- </li>
- <li>a <code>float</code> is written as 4 bytes. The float is
- converted into a 32-bit integer using a method equivalent
- to <a href="http://java.sun.com/javase/6/docs/api/java/lang/Float.html#floatToIntBits%28float%29">Java's floatToIntBits</a> and then encoded
- in little-endian format.</li>
- <li>a <code>double</code> is written as 8 bytes. The double
- is converted into a 64-bit integer using a method equivalent
- to <a href="http://java.sun.com/javase/6/docs/api/java/lang/Double.html#doubleToLongBits%28double%29">Java's
- doubleToLongBits</a> and then encoded in little-endian
- format.</li>
- <li>a <code>boolean</code> is written as a single byte whose
- value is either <code>0</code> (false) or <code>1</code>
- (true).</li>
- <li><code>null</code> is written as zero bytes.</li>
- </ul>
-
+ <section>
+ <title>Encodings</title>
+ <p>Avro specifies two serialization encodings: binary and
+ JSON. Most applications will use the binary encoding, as it
+ is smaller and faster. But, for debugging and web-based
+ applications, the JSON encoding may sometimes be
+ appropriate.</p>
</section>
+ <section id="binary_encoding">
+ <title>Binary Encoding</title>
- <section id="serialize_complex">
- <title>Complex Type Serialization</title>
- <p>Complex types are serialized as follows:</p>
+ <section id="binary_encode_primitive">
+ <title>Primitive Types</title>
+ <p>Primitive types are encoded in binary as follows:</p>
+ <ul>
+ <li>a <code>string</code> is encoded as
+ a <code>long</code> followed by that many bytes of UTF-8
+ encoded character data.
+ <p>For example, the three-character string "foo" would
+ be encoded as the long value 3 (encoded as
+ hex <code>06</code>) followed by the UTF-8 encoding of
+ 'f', 'o', and 'o' (the hex bytes <code>66 6f
+ 6f</code>):
+ </p>
+ <source>06 66 6f 6f</source>
+ </li>
+ <li><code>bytes</code> are encoded as
+ a <code>long</code> followed by that many bytes of data.
+ </li>
+ <li><code>int</code> and <code>long</code> values are written
+ using <a href="ext:vint">variable-length</a>
+ <a href="ext:zigzag">zig-zag</a> coding. Some examples:
+ <table class="right">
+ <tr><th>value</th><th>hex</th></tr>
+ <tr><td><code> 0</code></td><td><code>00</code></td></tr>
+ <tr><td><code>-1</code></td><td><code>01</code></td></tr>
+ <tr><td><code> 1</code></td><td><code>02</code></td></tr>
+ <tr><td><code>-2</code></td><td><code>03</code></td></tr>
+ <tr><td><code> 2</code></td><td><code>04</code></td></tr>
+ <tr><td colspan="2"><code>...</code></td></tr>
+ <tr><td><code>-64</code></td><td><code>7f</code></td></tr>
+ <tr><td><code> 64</code></td><td><code> 80 01</code></td></tr>
+ <tr><td colspan="2"><code>...</code></td></tr>
+ </table>
+ </li>
+ <li>a <code>float</code> is written as 4 bytes. The float is
+ converted into a 32-bit integer using a method equivalent
+ to <a href="http://java.sun.com/javase/6/docs/api/java/lang/Float.html#floatToIntBits%28float%29">Java's floatToIntBits</a> and then encoded
+ in little-endian format.</li>
+ <li>a <code>double</code> is written as 8 bytes. The double
+ is converted into a 64-bit integer using a method equivalent
+ to <a href="http://java.sun.com/javase/6/docs/api/java/lang/Double.html#doubleToLongBits%28double%29">Java's
+ doubleToLongBits</a> and then encoded in little-endian
+ format.</li>
+ <li>a <code>boolean</code> is written as a single byte whose
+ value is either <code>0</code> (false) or <code>1</code>
+ (true).</li>
+ <li><code>null</code> is written as zero bytes.</li>
+ </ul>
- <section>
- <title>Records</title>
- <p>A record is serialized by serializing the values of its
- fields in the order that they are declared. In other words,
- a record is serialized as just the concatenation of its
- field's serializations. Field values are serialized per
- their schema.</p>
- <p>For example, the record schema</p>
- <source>
-{
- "type": "record",
- "name": "test",
- "fields" : [
- {"name": "a", "type": "long"},
- {"name": "b", "type": "string"}
- ]
-}
- </source>
- <p>An instance of this record whose <code>a</code> field has
- value 27 (encoded as hex <code>36</code>) and
- whose <code>b</code> field has value "foo" (encoded as hex
- bytes <code>OC 66 6f 6f</code>), would be serialized simply
- as the concatenation of these, namely the hex byte
- sequence:</p>
- <source>36 0C 66 6f 6f</source>
- </section>
-
- <section>
- <title>Enums</title>
- <p>An enum is serialized by a <code>int</code>, representing
- the zero-based position of the symbol in the schema.</p>
- <p>For example, consider the enum:</p>
- <source>
-{"type": "enum", "name": "Foo", "symbols": ["A", "B", "C", "D"] }
- </source>
- <p>This would be serialized by an <code>int</code> between
- zero and three, with zero indicating "A", and 3 indicating
- "D".</p>
</section>
- <section>
- <title>Arrays</title>
- <p>Arrays are serialized as a series of <em>blocks</em>.
- Each block consists of a <code>long</code> <em>count</em>
- value, followed by that many array items. A block with
- count zero indicates the end of the array. Each item is
- serialized per the array's item schema.</p>
-
- <p>If a block's count is negative, then the count is
- followed immediately by a <code>long</code>
- block <em>size</em>, indicating the number of bytes in the
- block. The actual count in this case is the absolute value
- of the count written.</p>
-
- <p>For example, the array schema</p>
- <source>{"type": "array", "items": "long"}</source>
- <p>serializing an array containing the items 3 and 27 could be
- serialized as 2 (encoded as hex 04) followed by 3 and 27
- (encoded as hex <code>06 36</code>) terminated by zero:</p>
- <source>04 06 36 00</source>
-
- <p>The blocked representation permits one to read and write
- arrays larger than can be buffered in memory, since one can
- start writing items without knowing the full length of the
- array. The optional block sizes permit fast skipping
- through data, e.g., when projecting a record to a subset of
- its fields.</p>
+ <section id="binary_encode_complex">
+ <title>Complex Types</title>
+ <p>Complex types are encoded in binary as follows:</p>
+
+ <section>
+ <title>Records</title>
+ <p>A record is encoded by encoding the values of its
+ fields in the order that they are declared. In other
+ words, a record is encode as just the concatenation of
+ its field's encodings. Field values are encoded per
+ their schema.</p>
+ <p>For example, the record schema</p>
+ <source>
+ {
+ "type": "record",
+ "name": "test",
+ "fields" : [
+ {"name": "a", "type": "long"},
+ {"name": "b", "type": "string"}
+ ]
+ }
+ </source>
+ <p>An instance of this record whose <code>a</code> field has
+ value 27 (encoded as hex <code>36</code>) and
+ whose <code>b</code> field has value "foo" (encoded as hex
+ bytes <code>OC 66 6f 6f</code>), would be encoded simply
+ as the concatenation of these, namely the hex byte
+ sequence:</p>
+ <source>36 0C 66 6f 6f</source>
+ </section>
+
+ <section>
+ <title>Enums</title>
+ <p>An enum is encoded by a <code>int</code>, representing
+ the zero-based position of the symbol in the schema.</p>
+ <p>For example, consider the enum:</p>
+ <source>
+ {"type": "enum", "name": "Foo", "symbols": ["A", "B", "C", "D"] }
+ </source>
+ <p>This would be encoded by an <code>int</code> between
+ zero and three, with zero indicating "A", and 3 indicating
+ "D".</p>
+ </section>
+
+
+ <section>
+ <title>Arrays</title>
+ <p>Arrays are encoded as a series of <em>blocks</em>.
+ Each block consists of a <code>long</code> <em>count</em>
+ value, followed by that many array items. A block with
+ count zero indicates the end of the array. Each item is
+ encoded per the array's item schema.</p>
+
+ <p>If a block's count is negative, then the count is
+ followed immediately by a <code>long</code>
+ block <em>size</em>, indicating the number of bytes in the
+ block. The actual count in this case is the absolute value
+ of the count written.</p>
+
+ <p>For example, the array schema</p>
+ <source>{"type": "array", "items": "long"}</source>
+ <p>an array containing the items 3 and 27 could be encoded
+ as the long value 2 (encoded as hex 04) followed by long
+ values 3 and 27 (encoded as hex <code>06 36</code>)
+ terminated by zero:</p>
+ <source>04 06 36 00</source>
+
+ <p>The blocked representation permits one to read and write
+ arrays larger than can be buffered in memory, since one can
+ start writing items without knowing the full length of the
+ array. The optional block sizes permit fast skipping
+ through data, e.g., when projecting a record to a subset of
+ its fields.</p>
+
+ </section>
+
+ <section>
+ <title>Maps</title>
+ <p>Maps are encoded as a series of <em>blocks</em>. Each
+ block consists of a <code>long</code> <em>count</em>
+ value, followed by that many key/value pairs. A block
+ with count zero indicates the end of the map. Each item
+ is encoded per the map's value schema.</p>
+
+ <p>If a block's count is negative, then the count is
+ followed immediately by a <code>long</code>
+ block <em>size</em>, indicating the number of bytes in the
+ block. The actual count in this case is the absolute value
+ of the count written.</p>
+
+ <p>The blocked representation permits one to read and write
+ maps larger than can be buffered in memory, since one can
+ start writing items without knowing the full length of the
+ map. The optional block sizes permit fast skipping through
+ data, e.g., when projecting a record to a subset of its
+ fields.</p>
+
+ <p><em>NOTE: Blocking has not yet been fully implemented and
+ may change. Arbitrarily large objects must be easily
+ writable and readable but until we have proven this with an
+ implementation and tests this part of the specification
+ should be considered draft.</em></p>
+ </section>
+
+ <section>
+ <title>Unions</title>
+ <p>A union is encoded by first writing a <code>long</code>
+ value indicating the zero-based position within the
+ union of the schema of its value. The value is then
+ encoded per the indicated schema within the union.</p>
+ <p>For example, the union
+ schema <code>["string","null"]</code> would encode:</p>
+ <ul>
+ <li><code>null</code> as the integer 1 (the index of
+ "null" in the union, encoded as
+ hex <code>02</code>): <source>02</source></li>
+ <li>the string <code>"a"</code> as zero (the index of
+ "string" in the union), followed by the serialized string:
+ <source>00 02 61</source></li>
+ </ul>
+ </section>
+
+ <section>
+ <title>Fixed</title>
+ <p>Fixed instances are encoded using the number of bytes
+ declared in the schema.</p>
+ </section>
- </section>
+ </section> <!-- end complex types -->
- <section>
- <title>Maps</title>
- <p>Maps are serialized as a series of <em>blocks</em>. Each
- block consists of a <code>long</code> <em>count</em> value,
- followed by that many key/value pairs. A block with count
- zero indicates the end of the map. Each item is serialized
- per the map's value schema.</p>
-
- <p>If a block's count is negative, then the count is
- followed immediately by a <code>long</code>
- block <em>size</em>, indicating the number of bytes in the
- block. The actual count in this case is the absolute value
- of the count written.</p>
-
- <p>The blocked representation permits one to read and write
- maps larger than can be buffered in memory, since one can
- start writing items without knowing the full length of the
- map. The optional block sizes permit fast skipping through
- data, e.g., when projecting a record to a subset of its
- fields.</p>
-
- <p><em>NOTE: Blocking has not yet been fully implemented and
- may change. Arbitrarily large objects must be easily
- writable and readable but until we have proven this with an
- implementation and tests this part of the specification
- should be considered draft.</em></p>
- </section>
+ </section>
- <section>
- <title>Unions</title>
- <p>A union is serialized by first writing
- a <code>long</code> value indicating the zero-based
- position within the union of the schema of its value. The
- value is then serialized per the indicated schema within
- the union.</p>
- <p>For example, the union
- schema <code>["string","null"]</code> would serialize:</p>
- <ul>
- <li><code>null</code> as 1 (the index of "null" in the
- union, encoded as hex <code>02</code>): <source>02</source></li>
- <li>the string <code>"a"</code> as zero (the index of
- "string" in the union), followed by the serialized string:
- <source>00 02 61</source></li>
- </ul>
- </section>
+ <section id="json_encoding">
+ <title>JSON Encoding</title>
+
+ <p>Except for unions, the JSON encoding is the same as is used
+ to encode <a href="#schema_record">field default
+ values</a>.</p>
- <section>
- <title>Fixed</title>
- <p>Fixed instances are serialized using the number of bytes
- declared in the schema.</p>
- </section>
+ <p>The value of a union is encoded in JSON as follows:</p>
- </section> <!-- end complex types -->
+ <ul>
+ <li>if its type is <code>null</code>, then it is encoded as
+ a JSON null;</li>
+ <li>otherwise it is encoded as a JSON object with one
+ name/value pair whose name is the type's name and whose
+ value is the recursively encoded value. For Avro's named
+ types (record, fixed or enum) the user-specified name is
+ used, for other types the type name is used.</li>
+ </ul>
+
+ <p>For example, the union
+ schema <code>["null","string","Foo"]</code>, where Foo is a
+ record name, would encode:</p>
+ <ul>
+ <li><code>null</code> as <code>null</code>;</li>
+ <li>the string <code>"a"</code> as
+ <code>{"string": "a"}</code>; and</li>
+ <li>a Foo instance as <code>{"Foo": {...}}</code>,
+ where <code>{...}</code> indicates the JSON encoding of a
+ Foo instance.</li>
+ </ul>
+
+ <p>Note that a schema is still required to correctly process
+ JSON-encoded data. For example, the JSON encoding does not
+ distinguish between <code>int</code>
+ and <code>long</code>, <code>float</code>
+ and <code>double</code>, records and maps, enums and strings,
+ etc.</p>
+
+ </section>
</section>