You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/06/12 14:19:46 UTC

[arrow] branch master updated: ARROW-4194: [Format][Docs] Remove duplicated / out-of-date logical type information from documentation

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new c2da956  ARROW-4194: [Format][Docs] Remove duplicated / out-of-date logical type information from documentation
c2da956 is described below

commit c2da956ac061fe8128edf8ce7527a563e791e170
Author: Wes McKinney <we...@apache.org>
AuthorDate: Wed Jun 12 09:19:33 2019 -0500

    ARROW-4194: [Format][Docs] Remove duplicated / out-of-date logical type information from documentation
    
    This documentation is not being properly maintained and it duplicates the point-of-truth information in format/Schema.fbs.
    
    I also split out the integration-testing related JSON stuff to a separate section and opened ARROW-5563 about sprucing that up.
    
    Author: Wes McKinney <we...@apache.org>
    
    Closes #4523 from wesm/remove-outdated-metadata-stuff and squashes the following commits:
    
    276412604 <Wes McKinney> Remove duplicated / out-of-date logical type information from documentation and direct readers to Schema.fbs
---
 docs/source/format/Metadata.rst | 245 ++++++++++++----------------------------
 1 file changed, 70 insertions(+), 175 deletions(-)

diff --git a/docs/source/format/Metadata.rst b/docs/source/format/Metadata.rst
index 293d011..b6c2a5f 100644
--- a/docs/source/format/Metadata.rst
+++ b/docs/source/format/Metadata.rst
@@ -65,96 +65,6 @@ the columns. The Flatbuffers IDL for a field is: ::
 The ``type`` is the logical type of the field. Nested types, such as List,
 Struct, and Union, have a sequence of child fields.
 
-A JSON representation of the schema is also provided:
-
-Field: ::
-
-    {
-      "name" : "name_of_the_field",
-      "nullable" : false,
-      "type" : /* Type */,
-      "children" : [ /* Field */ ],
-    }
-
-Type: ::
-
-    {
-      "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
-      // fields as defined in the Flatbuffer depending on the type name
-    }
-
-Union: ::
-
-    {
-      "name" : "union",
-      "mode" : "Sparse|Dense",
-      "typeIds" : [ /* integer */ ]
-    }
-
-The ``typeIds`` field in the Union are the codes used to denote each type, which
-may be different from the index of the child array. This is so that the union
-type ids do not have to be enumerated from 0.
-
-Int: ::
-
-    {
-      "name" : "int",
-      "bitWidth" : /* integer */,
-      "isSigned" : /* boolean */
-    }
-
-FloatingPoint: ::
-
-    {
-      "name" : "floatingpoint",
-      "precision" : "HALF|SINGLE|DOUBLE"
-    }
-
-Decimal: ::
-
-    {
-      "name" : "decimal",
-      "precision" : /* integer */,
-      "scale" : /* integer */
-    }
-
-Timestamp: ::
-
-    {
-      "name" : "timestamp",
-      "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
-    }
-
-Date: ::
-
-    {
-      "name" : "date",
-      "unit" : "DAY|MILLISECOND"
-    }
-
-Time: ::
-
-    {
-      "name" : "time",
-      "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
-      "bitWidth": /* integer: 32 or 64 */
-    }
-
-Interval: ::
-
-    {
-      "name" : "interval",
-      "unit" : "YEAR_MONTH|DAY_TIME"
-    }
-
-Schema: ::
-
-    {
-      "fields" : [
-        /* Field */
-      ]
-    }
-
 Record data headers
 -------------------
 
@@ -280,117 +190,102 @@ categories:
 * Types having equivalent memory layout to a physical nested type (e.g. strings
   use the list representation, but logically are not nested types)
 
-Integers
-~~~~~~~~
+Refer to `Schema.fbs`_ for up-to-date descriptions of each built-in
+logical type.
 
-In the first version of Arrow we provide the standard 8-bit through 64-bit size
-standard C integer types, both signed and unsigned:
+Integration Testing
+-------------------
 
-* Signed types: Int8, Int16, Int32, Int64
-* Unsigned types: UInt8, UInt16, UInt32, UInt64
+A JSON representation of the schema is provided for cross-language
+integration testing purposes.
 
-The IDL looks like: ::
+Field: ::
 
-    table Int {
-      bitWidth: int;
-      is_signed: bool;
+    {
+      "name" : "name_of_the_field",
+      "nullable" : false,
+      "type" : /* Type */,
+      "children" : [ /* Field */ ],
     }
 
-The integer endianness is currently set globally at the schema level. If a
-schema is set to be little-endian, then all integer types occurring within must
-be little-endian. Integers that are part of other data representations, such as
-list offsets and union types, must have the same endianness as the entire
-record batch.
-
-Floating point numbers
-~~~~~~~~~~~~~~~~~~~~~~
-
-We provide 3 types of floating point numbers as fixed bit-width primitive array
-
-- Half precision, 16-bit width
-- Single precision, 32-bit width
-- Double precision, 64-bit width
-
-The IDL looks like: ::
-
-    enum Precision:int {HALF, SINGLE, DOUBLE}
+Type: ::
 
-    table FloatingPoint {
-      precision: Precision;
+    {
+      "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
+      // fields as defined in the Flatbuffer depending on the type name
     }
 
-Boolean
-~~~~~~~
-
-The Boolean logical type is represented as a 1-bit wide primitive physical
-type. The bits are numbered using least-significant bit (LSB) ordering.
-
-Like other fixed bit-width primitive types, boolean data appears as 2 buffers
-in the data header (one bitmap for the validity vector and one for the values).
-
-List
-~~~~
-
-The ``List`` logical type is the logical (and identically-named) counterpart to
-the List physical type.
-
-In data header form, the list field node contains 2 buffers:
+Union: ::
 
-* Validity bitmap
-* List offsets
+    {
+      "name" : "union",
+      "mode" : "Sparse|Dense",
+      "typeIds" : [ /* integer */ ]
+    }
 
-The buffers associated with a list's child field are handled recursively
-according to the child logical type (e.g. ``List<Utf8>`` vs. ``List<Boolean>``).
+The ``typeIds`` field in the Union are the codes used to denote each type, which
+may be different from the index of the child array. This is so that the union
+type ids do not have to be enumerated from 0.
 
-Utf8 and Binary
-~~~~~~~~~~~~~~~
+Int: ::
 
-We specify two logical types for variable length bytes:
+    {
+      "name" : "int",
+      "bitWidth" : /* integer */,
+      "isSigned" : /* boolean */
+    }
 
-* ``Utf8`` data is Unicode values with UTF-8 encoding
-* ``Binary`` is any other variable length bytes
+FloatingPoint: ::
 
-These types both have the same memory layout as the nested type ``List<UInt8>``,
-with the constraint that the inner bytes can contain no null values. From a
-logical type perspective they are primitive, not nested types.
+    {
+      "name" : "floatingpoint",
+      "precision" : "HALF|SINGLE|DOUBLE"
+    }
 
-In data header form, while ``List<UInt8>`` would appear as 2 field nodes (``List``
-and ``UInt8``) and 4 buffers (2 for each of the nodes, as per above), these types
-have a simplified representation single field node (of ``Utf8`` or ``Binary``
-logical type, which have no children) and 3 buffers:
+Decimal: ::
 
-* Validity bitmap
-* List offsets
-* Byte data
+    {
+      "name" : "decimal",
+      "precision" : /* integer */,
+      "scale" : /* integer */
+    }
 
-Decimal
-~~~~~~~
+Timestamp: ::
 
-Decimals are represented as a 2's complement 128-bit (16 byte) signed integer
-in little-endian byte order.
+    {
+      "name" : "timestamp",
+      "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
+    }
 
-Timestamp
-~~~~~~~~~
+Date: ::
 
-All timestamps are stored as a 64-bit integer, with one of four unit
-resolutions: second, millisecond, microsecond, and nanosecond.
+    {
+      "name" : "date",
+      "unit" : "DAY|MILLISECOND"
+    }
 
-Date
-~~~~
+Time: ::
 
-We support two different date types:
+    {
+      "name" : "time",
+      "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
+      "bitWidth": /* integer: 32 or 64 */
+    }
 
-* Days since the UNIX epoch as a 32-bit integer
-* Milliseconds since the UNIX epoch as a 64-bit integer
+Interval: ::
 
-Time
-~~~~
+    {
+      "name" : "interval",
+      "unit" : "YEAR_MONTH|DAY_TIME"
+    }
 
-Time supports the same unit resolutions: second, millisecond, microsecond, and
-nanosecond. We represent time as the smallest integer accommodating the
-indicated unit. For second and millisecond: 32-bit, for the others 64-bit.
+Schema: ::
 
-Dictionary encoding
--------------------
+    {
+      "fields" : [
+        /* Field */
+      ]
+    }
 
 .. _Flatbuffers: http://github.com/google/flatbuffers
+.. _Schema.fbs: https://github.com/apache/arrow/blob/master/format/Schema.fbs