You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/06/12 14:19:46 UTC
[arrow] branch master updated: ARROW-4194: [Format][Docs] Remove
duplicated / out-of-date logical type information from documentation
This is an automated email from the ASF dual-hosted git repository.
wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new c2da956 ARROW-4194: [Format][Docs] Remove duplicated / out-of-date logical type information from documentation
c2da956 is described below
commit c2da956ac061fe8128edf8ce7527a563e791e170
Author: Wes McKinney <we...@apache.org>
AuthorDate: Wed Jun 12 09:19:33 2019 -0500
ARROW-4194: [Format][Docs] Remove duplicated / out-of-date logical type information from documentation
This documentation is not being properly maintained and it duplicates the point-of-truth information in format/Schema.fbs.
I also split out the integration-testing related JSON stuff to a separate section and opened ARROW-5563 about sprucing that up.
Author: Wes McKinney <we...@apache.org>
Closes #4523 from wesm/remove-outdated-metadata-stuff and squashes the following commits:
276412604 <Wes McKinney> Remove duplicated / out-of-date logical type information from documentation and direct readers to Schema.fbs
---
docs/source/format/Metadata.rst | 245 ++++++++++++----------------------------
1 file changed, 70 insertions(+), 175 deletions(-)
diff --git a/docs/source/format/Metadata.rst b/docs/source/format/Metadata.rst
index 293d011..b6c2a5f 100644
--- a/docs/source/format/Metadata.rst
+++ b/docs/source/format/Metadata.rst
@@ -65,96 +65,6 @@ the columns. The Flatbuffers IDL for a field is: ::
The ``type`` is the logical type of the field. Nested types, such as List,
Struct, and Union, have a sequence of child fields.
-A JSON representation of the schema is also provided:
-
-Field: ::
-
- {
- "name" : "name_of_the_field",
- "nullable" : false,
- "type" : /* Type */,
- "children" : [ /* Field */ ],
- }
-
-Type: ::
-
- {
- "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
- // fields as defined in the Flatbuffer depending on the type name
- }
-
-Union: ::
-
- {
- "name" : "union",
- "mode" : "Sparse|Dense",
- "typeIds" : [ /* integer */ ]
- }
-
-The ``typeIds`` field in the Union are the codes used to denote each type, which
-may be different from the index of the child array. This is so that the union
-type ids do not have to be enumerated from 0.
-
-Int: ::
-
- {
- "name" : "int",
- "bitWidth" : /* integer */,
- "isSigned" : /* boolean */
- }
-
-FloatingPoint: ::
-
- {
- "name" : "floatingpoint",
- "precision" : "HALF|SINGLE|DOUBLE"
- }
-
-Decimal: ::
-
- {
- "name" : "decimal",
- "precision" : /* integer */,
- "scale" : /* integer */
- }
-
-Timestamp: ::
-
- {
- "name" : "timestamp",
- "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
- }
-
-Date: ::
-
- {
- "name" : "date",
- "unit" : "DAY|MILLISECOND"
- }
-
-Time: ::
-
- {
- "name" : "time",
- "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
- "bitWidth": /* integer: 32 or 64 */
- }
-
-Interval: ::
-
- {
- "name" : "interval",
- "unit" : "YEAR_MONTH|DAY_TIME"
- }
-
-Schema: ::
-
- {
- "fields" : [
- /* Field */
- ]
- }
-
Record data headers
-------------------
@@ -280,117 +190,102 @@ categories:
* Types having equivalent memory layout to a physical nested type (e.g. strings
use the list representation, but logically are not nested types)
-Integers
-~~~~~~~~
+Refer to `Schema.fbs`_ for up-to-date descriptions of each built-in
+logical type.
-In the first version of Arrow we provide the standard 8-bit through 64-bit size
-standard C integer types, both signed and unsigned:
+Integration Testing
+-------------------
-* Signed types: Int8, Int16, Int32, Int64
-* Unsigned types: UInt8, UInt16, UInt32, UInt64
+A JSON representation of the schema is provided for cross-language
+integration testing purposes.
-The IDL looks like: ::
+Field: ::
- table Int {
- bitWidth: int;
- is_signed: bool;
+ {
+ "name" : "name_of_the_field",
+ "nullable" : false,
+ "type" : /* Type */,
+ "children" : [ /* Field */ ],
}
-The integer endianness is currently set globally at the schema level. If a
-schema is set to be little-endian, then all integer types occurring within must
-be little-endian. Integers that are part of other data representations, such as
-list offsets and union types, must have the same endianness as the entire
-record batch.
-
-Floating point numbers
-~~~~~~~~~~~~~~~~~~~~~~
-
-We provide 3 types of floating point numbers as fixed bit-width primitive array
-
-- Half precision, 16-bit width
-- Single precision, 32-bit width
-- Double precision, 64-bit width
-
-The IDL looks like: ::
-
- enum Precision:int {HALF, SINGLE, DOUBLE}
+Type: ::
- table FloatingPoint {
- precision: Precision;
+ {
+ "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
+ // fields as defined in the Flatbuffer depending on the type name
}
-Boolean
-~~~~~~~
-
-The Boolean logical type is represented as a 1-bit wide primitive physical
-type. The bits are numbered using least-significant bit (LSB) ordering.
-
-Like other fixed bit-width primitive types, boolean data appears as 2 buffers
-in the data header (one bitmap for the validity vector and one for the values).
-
-List
-~~~~
-
-The ``List`` logical type is the logical (and identically-named) counterpart to
-the List physical type.
-
-In data header form, the list field node contains 2 buffers:
+Union: ::
-* Validity bitmap
-* List offsets
+ {
+ "name" : "union",
+ "mode" : "Sparse|Dense",
+ "typeIds" : [ /* integer */ ]
+ }
-The buffers associated with a list's child field are handled recursively
-according to the child logical type (e.g. ``List<Utf8>`` vs. ``List<Boolean>``).
+The ``typeIds`` field in the Union are the codes used to denote each type, which
+may be different from the index of the child array. This is so that the union
+type ids do not have to be enumerated from 0.
-Utf8 and Binary
-~~~~~~~~~~~~~~~
+Int: ::
-We specify two logical types for variable length bytes:
+ {
+ "name" : "int",
+ "bitWidth" : /* integer */,
+ "isSigned" : /* boolean */
+ }
-* ``Utf8`` data is Unicode values with UTF-8 encoding
-* ``Binary`` is any other variable length bytes
+FloatingPoint: ::
-These types both have the same memory layout as the nested type ``List<UInt8>``,
-with the constraint that the inner bytes can contain no null values. From a
-logical type perspective they are primitive, not nested types.
+ {
+ "name" : "floatingpoint",
+ "precision" : "HALF|SINGLE|DOUBLE"
+ }
-In data header form, while ``List<UInt8>`` would appear as 2 field nodes (``List``
-and ``UInt8``) and 4 buffers (2 for each of the nodes, as per above), these types
-have a simplified representation single field node (of ``Utf8`` or ``Binary``
-logical type, which have no children) and 3 buffers:
+Decimal: ::
-* Validity bitmap
-* List offsets
-* Byte data
+ {
+ "name" : "decimal",
+ "precision" : /* integer */,
+ "scale" : /* integer */
+ }
-Decimal
-~~~~~~~
+Timestamp: ::
-Decimals are represented as a 2's complement 128-bit (16 byte) signed integer
-in little-endian byte order.
+ {
+ "name" : "timestamp",
+ "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
+ }
-Timestamp
-~~~~~~~~~
+Date: ::
-All timestamps are stored as a 64-bit integer, with one of four unit
-resolutions: second, millisecond, microsecond, and nanosecond.
+ {
+ "name" : "date",
+ "unit" : "DAY|MILLISECOND"
+ }
-Date
-~~~~
+Time: ::
-We support two different date types:
+ {
+ "name" : "time",
+ "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
+ "bitWidth": /* integer: 32 or 64 */
+ }
-* Days since the UNIX epoch as a 32-bit integer
-* Milliseconds since the UNIX epoch as a 64-bit integer
+Interval: ::
-Time
-~~~~
+ {
+ "name" : "interval",
+ "unit" : "YEAR_MONTH|DAY_TIME"
+ }
-Time supports the same unit resolutions: second, millisecond, microsecond, and
-nanosecond. We represent time as the smallest integer accommodating the
-indicated unit. For second and millisecond: 32-bit, for the others 64-bit.
+Schema: ::
-Dictionary encoding
--------------------
+ {
+ "fields" : [
+ /* Field */
+ ]
+ }
.. _Flatbuffers: http://github.com/google/flatbuffers
+.. _Schema.fbs: https://github.com/apache/arrow/blob/master/format/Schema.fbs