You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@avro.apache.org by dk...@apache.org on 2018/11/02 19:31:26 UTC

[avro] branch master updated: Clarify importance of writer's schema in documentation (master) (#91)

This is an automated email from the ASF dual-hosted git repository.

dkulp pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/avro.git


The following commit(s) were added to refs/heads/master by this push:
     new 9855529  Clarify importance of writer's schema in documentation (master) (#91)
9855529 is described below

commit 9855529f1af979ade479d5658bbe90917ae2734b
Author: Shannon Carey <re...@gmail.com>
AuthorDate: Fri Nov 2 14:31:19 2018 -0500

    Clarify importance of writer's schema in documentation (master) (#91)
    
    * Clarify importance of writer's schema in documentation
    
    * Rewording according to code review comments
    
    * Clarify that it's not necessary for the schema used to read data to have an identical Parsing Canonical Form to the schema used to serialize the data.
    However, it is a recommended and generally accepted practice.
---
 doc/src/content/xdocs/gettingstartedjava.xml | 24 ++++++++-
 doc/src/content/xdocs/spec.xml               | 75 ++++++++++++++++++----------
 2 files changed, 72 insertions(+), 27 deletions(-)

diff --git a/doc/src/content/xdocs/gettingstartedjava.xml b/doc/src/content/xdocs/gettingstartedjava.xml
index 7f331e3..900e1eb 100644
--- a/doc/src/content/xdocs/gettingstartedjava.xml
+++ b/doc/src/content/xdocs/gettingstartedjava.xml
@@ -288,7 +288,17 @@ System.out.println(user);
           class, in this case <code>User</code>.  We pass the
           <code>DatumReader</code> and the previously created <code>File</code>
           to a <code>DataFileReader</code>, analogous to the
-          <code>DataFileWriter</code>, which reads the data file on disk.
+          <code>DataFileWriter</code>, which reads both the schema used by the
+          writer as well as the data from the file on disk. The data will be
+          read using the writer's schema included in the file and the
+          schema provided by the reader, in this case the <code>User</code>
+          class.  The writer's schema is needed to know the order in which
+          fields were written, while the reader's schema is needed to know what
+          fields are expected and how to fill in default values for fields
+          added since the file was written.  If there are differences between
+          the two schemas, they are resolved according to the
+          <a href="spec.html#Schema+Resolution">Schema Resolution</a>
+          specification.
         </p>
         <p>
           Next we use the <code>DataFileReader</code> to iterate through the
@@ -477,7 +487,17 @@ System.out.println(user);
           converts in-memory serialized items into <code>GenericRecords</code>.
           We pass the <code>DatumReader</code> and the previously created
           <code>File</code> to a <code>DataFileReader</code>, analogous to the
-          <code>DataFileWriter</code>, which reads the data file on disk.
+          <code>DataFileWriter</code>, which reads both the schema used by the
+          writer as well as the data from the file on disk. The data will be
+          read using the writer's schema included in the file, and the reader's
+          schema provided to the <code>GenericDatumReader</code>.  The writer's
+          schema is needed to know the order in which fields were written,
+          while the reader's schema is needed to know what fields are expected
+          and how to fill in default values for fields added since the file
+          was written.  If there are differences between the two schemas, they
+          are resolved according to the
+          <a href="spec.html#Schema+Resolution">Schema Resolution</a>
+          specification.
         </p>
         <p>
           Next, we use the <code>DataFileReader</code> to iterate through the
diff --git a/doc/src/content/xdocs/spec.xml b/doc/src/content/xdocs/spec.xml
index fd780f9..0c3ff0b 100644
--- a/doc/src/content/xdocs/spec.xml
+++ b/doc/src/content/xdocs/spec.xml
@@ -332,21 +332,41 @@
     </section> <!-- end schemas -->
 
     <section>
-      <title>Data Serialization</title>
-
-      <p>Avro data is always serialized with its schema.  Files that
-	store Avro data should always also include the schema for that
-	data in the same file.  Avro-based remote procedure call (RPC)
-	systems must also guarantee that remote recipients of data
-	have a copy of the schema used to write that data.</p>
-
-      <p>Because the schema used to write data is always available
-	when the data is read, Avro data itself is not tagged with
-	type information.  The schema is required to parse data.</p>
-
-      <p>In general, both serialization and deserialization proceed as
-      a depth-first, left-to-right traversal of the schema,
-      serializing primitive types as they are encountered.</p>
+      <title>Data Serialization and Deserialization</title>
+
+      <p>Binary encoded Avro data does not include type information or
+      field names.  The benefit is that the serialized data is small, but
+      as a result a schema must always be used in order to read Avro data
+      correctly.  The best way to ensure that the schema is structurally
+      identical to the one used to write the data is to use the exact same
+      schema.</p>
+
+      <p>Therefore, files or systems that store Avro data should always
+      include the writer's schema for that data.  Avro-based remote procedure
+      call (RPC) systems must also guarantee that remote recipients of data
+      have a copy of the schema used to write that data.  In general, it is
+      advisable that any reader of Avro data should use a schema that is
+      the same (as defined more fully in
+      <a href="#Parsing+Canonical+Form+for+Schemas">Parsing Canonical Form for
+      Schemas</a>) as the schema that was used to write the data in order to
+      deserialize it correctly. Deserializing data into a newer schema is
+      accomplished by specifying an additional schema, the results of which are
+      described in <a href="#Schema+Resolution">Schema Resolution</a>.</p>
+
+      <p>In general, both serialization and deserialization proceed as a
+      depth-first, left-to-right traversal of the schema, serializing or
+      deserializing primitive types as they are encountered. Therefore, it is
+      possible, though not advisable, to read Avro data with a schema that
+      does not have the same Parsing Canonical Form as the schema with which
+      the data was written. In order for this to work, the serialized primitive
+      values must be compatible, in order value by value, with the items in the
+      deserialization schema. For example, int and long are always serialized
+      the same way, so an int could be deserialized as a long.  Since the
+      compatibility of two schemas depends on both the data and the
+      serialization format (eg. binary is more permissive than JSON because JSON
+      includes field names, eg. a long that is too large will overflow an int),
+      it is simpler and more reliable to use schemas with identical Parsing
+      Canonical Form.</p>
 
       <section>
 	<title>Encodings</title>
@@ -359,6 +379,10 @@
 
       <section id="binary_encoding">
         <title>Binary Encoding</title>
+        <p>Binary encoding does not include field names, self-contained
+          information about the types of individual bytes, nor field or
+          record separators. Therefore readers are wholly reliant on
+          the schema used when the data was encoded.</p>
 
 	<section id="binary_encode_primitive">
           <title>Primitive Types</title>
@@ -566,8 +590,8 @@
           Foo instance.</li>
         </ul>
 
-        <p>Note that a schema is still required to correctly process
-        JSON-encoded data.  For example, the JSON encoding does not
+        <p>Note that the original schema is still required to correctly
+        process JSON-encoded data.  For example, the JSON encoding does not
         distinguish between <code>int</code>
         and <code>long</code>, <code>float</code>
         and <code>double</code>, records and maps, enums and strings,
@@ -1086,14 +1110,15 @@
       <title>Schema Resolution</title>
 
       <p>A reader of Avro data, whether from an RPC or a file, can
-        always parse that data because its schema is provided.  But
-        that schema may not be exactly the schema that was expected.
+        always parse that data because the original schema must be
+        provided along with the data.  However, the reader may be
+        programmed to read data into a different schema.
         For example, if the data was written with a different version
-        of the software than it is read, then records may have had
-        fields added or removed.  This section specifies how such
+        of the software than it is read, then fields may have been
+        added or removed from records.  This section specifies how such
         schema differences should be resolved.</p>
 
-      <p>We call the schema used to write the data as
+      <p>We refer to the schema used to write the data as
         the <em>writer's</em> schema, and the schema that the
         application expects the <em>reader's</em> schema.  Differences
         between these should be resolved as follows:</p>
@@ -1190,8 +1215,8 @@
       <title>Parsing Canonical Form for Schemas</title>
 
       <p>One of the defining characteristics of Avro is that a reader
-      is assumed to have the "same" schema used by the writer of the
-      data the reader is reading.  This assumption leads to a data
+      must use the schema used by the writer of the data in
+      order to know how to read the data.  This assumption results in a data
       format that's compact and also amenable to many forms of schema
       evolution.  However, the specification so far has not defined
       what it means for the reader to have the "same" schema as the
@@ -1205,7 +1230,7 @@
       <p><em>Parsing Canonical Form</em> is a transformation of a
       writer's schema that let's us define what it means for two
       schemas to be "the same" for the purpose of reading data written
-      agains the schema.  It is called <em>Parsing</em> Canonical Form
+      against the schema.  It is called <em>Parsing</em> Canonical Form
       because the transformations strip away parts of the schema, like
       "doc" attributes, that are irrelevant to readers trying to parse
       incoming data.  It is called <em>Canonical Form</em> because the