You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2016/08/28 17:43:11 UTC
arrow git commit: ARROW-262: Start metadata specification document

Repository: arrow
Updated Branches:
  refs/heads/master 803afeb50 -> 907cc5a12


ARROW-262: Start metadata specification document

The purpose of this patch is to:

* Provide exposition and a place to clarify / provide examples illustrating the canonical metadata
* Begin providing definitions of logical types
* Where relevant, the data header metadata generated by a particular logical type (for example: strings produce one fewer buffer compared with List<UInt8> even though the effective memory layout is the same as a the nested type without any nulls in its child array)

This is not a complete specification and will require follow-up JIRAs to address more logical types and fill other gaps.

Author: Wes McKinney <we...@apache.org>

Closes #121 from wesm/ARROW-262 and squashes the following commits:

bba5e82 [Wes McKinney] int->short
8cc52fd [Wes McKinney] Drafting Metadata specification document


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/907cc5a1
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/907cc5a1
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/907cc5a1

Branch: refs/heads/master
Commit: 907cc5a1295c4e9227ac50abf5babbe497f1edd1
Parents: 803afeb
Author: Wes McKinney <we...@apache.org>
Authored: Sun Aug 28 13:43:01 2016 -0400
Committer: Wes McKinney <we...@apache.org>
Committed: Sun Aug 28 13:43:01 2016 -0400

----------------------------------------------------------------------
 format/Message.fbs |   3 +-
 format/Metadata.md | 258 ++++++++++++++++++++++++++++++++++++++++++++++++
 format/README.md   |   1 +
 3 files changed, 261 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/907cc5a1/format/Message.fbs
----------------------------------------------------------------------
diff --git a/format/Message.fbs b/format/Message.fbs
index b02f3fa..71428b5 100644
--- a/format/Message.fbs
+++ b/format/Message.fbs
@@ -28,12 +28,13 @@ table Int {
   is_signed: bool;
 }
 
-enum Precision:short {SINGLE, DOUBLE}
+enum Precision:short {HALF, SINGLE, DOUBLE}
 
 table FloatingPoint {
   precision: Precision;
 }
 
+/// Unicode with UTF-8 encoding
 table Utf8 {
 }
 

http://git-wip-us.apache.org/repos/asf/arrow/blob/907cc5a1/format/Metadata.md
----------------------------------------------------------------------
diff --git a/format/Metadata.md b/format/Metadata.md
new file mode 100644
index 0000000..e227b8d
--- /dev/null
+++ b/format/Metadata.md
@@ -0,0 +1,258 @@
+# Metadata: Logical types, schemas, data headers
+
+This is documentation for the Arrow metadata specification, which enables
+systems to communicate the
+
+* Logical array types (which are implemented using the physical memory layouts
+  specified in [Layout.md][1])
+
+* Schemas for table-like collections of Arrow data structures
+
+* "Data headers" indicating the physical locations of memory buffers sufficient
+  to reconstruct a Arrow data structures without copying memory.
+
+## Canonical implementation
+
+We are using [Flatbuffers][2] for low-overhead reading and writing of the Arrow
+metadata. See [Message.fbs][3].
+
+## Schemas
+
+The `Schema` type describes a table-like structure consisting of any number of
+Arrow arrays, each of which can be interpreted as a column in the table. A
+schema by itself does not describe the physical structure of any particular set
+of data.
+
+A schema consists of a sequence of **fields**, which are metadata describing
+the columns. The Flatbuffers IDL for a field is:
+
+```
+table Field {
+  // Name is not required, in i.e. a List
+  name: string;
+  nullable: bool;
+  type: Type;
+  children: [Field];
+}
+```
+
+The `type` is the logical type of the field. Nested types, such as List,
+Struct, and Union, have a sequence of child fields.
+
+## Record data headers
+
+A record batch is a collection of top-level named, equal length Arrow arrays
+(or vectors). If one of the arrays contains nested data, its child arrays are
+not required to be the same length as the top-level arrays.
+
+One can be thought of as a realization of a particular schema. The metadata
+describing a particular record batch is called a "data header". Here is the
+Flatbuffers IDL for a record batch data header
+
+```
+table RecordBatch {
+  length: int;
+  nodes: [FieldNode];
+  buffers: [Buffer];
+}
+```
+
+The `nodes` and `buffers` fields are produced by a depth-first traversal /
+flattening of a schema (possibly containing nested types) for a given in-memory
+data set.
+
+### Buffers
+
+A buffer is metadata describing a contiguous memory region relative to some
+virtual address space. This may include:
+
+* Shared memory, e.g. a memory-mapped file
+* An RPC message received in-memory
+* Data in a file
+
+The key form of the Buffer type is:
+
+```
+struct Buffer {
+  offset: long;
+  length: long;
+}
+```
+
+In the context of a record batch, each field has some number of buffers
+associated with it, which are derived from their physical memory layout.
+
+Each logical type (separate from its children, if it is a nested type) has a
+deterministic number of buffers associated with it. These will be specified in
+the logical types section.
+
+### Field metadata
+
+The `FieldNode` values contain metadata about each level in a nested type
+hierarchy.
+
+```
+struct FieldNode {
+  /// The number of value slots in the Arrow array at this level of a nested
+  /// tree
+  length: int;
+
+  /// The number of observed nulls.
+  null_count: int;
+}
+```
+
+## Flattening of nested data
+
+Nested types are flattened in the record batch in depth-first order. When
+visiting each field in the nested type tree, the metadata is appended to the
+top-level `fields` array and the buffers associated with that field (but not
+its children) are appended to the `buffers` array.
+
+For example, let's consider the schema
+
+```
+col1: Struct<a: Int32, b: List<Int64>, c: Float64>
+col2: Utf8
+```
+
+The flattened version of this is:
+
+```
+FieldNode 0: Struct name='col1'
+FieldNode 1: Int32 name=a'
+FieldNode 2: List name='b'
+FieldNode 3: Int64 name='item'  # arbitrary
+FieldNode 4: Float64 name='c'
+FieldNode 5: Utf8 name='col2'
+```
+
+For the buffers produced, we would have the following (as described in more
+detail for each type below):
+
+```
+buffer 0: field 0 validity bitmap
+
+buffer 1: field 1 validity bitmap
+buffer 2: field 1 values <int32_t*>
+
+buffer 3: field 2 validity bitmap
+buffer 4: field 2 list offsets <int32_t*>
+
+buffer 5: field 3 validity bitmap
+buffer 6: field 3 values <int64_t*>
+
+buffer 7: field 4 validity bitmap
+buffer 8: field 4 values <double*>
+
+buffer 9: field 5 validity bitmap
+buffer 10: field 5 offsets <int32_t*>
+buffer 11: field 5 data <uint8_t*>
+```
+
+## Logical types
+
+A logical type consists of a type name and metadata along with an explicit
+mapping to a physical memory representation. These may fall into some different
+categories:
+
+* Types represented as fixed-width primitive arrays (for example: C-style
+  integers and floating point numbers)
+* Types having equivalent memory layout to a physical nested type (e.g. strings
+  use the list representation, but logically are not nested types)
+
+### Integers
+
+In the first version of Arrow we provide the standard 8-bit through 64-bit size
+standard C integer types, both signed and unsigned:
+
+* Signed types: Int8, Int16, Int32, Int64
+* Unsigned types: UInt8, UInt16, UInt32, UInt64
+
+The IDL looks like:
+
+```
+table Int {
+  bitWidth: int;
+  is_signed: bool;
+}
+```
+
+The integer endianness is currently set globally at the schema level. If a
+schema is set to be little-endian, then all integer types occurring within must
+be little-endian. Integers that are part of other data representations, such as
+list offsets and union types, must have the same endianness as the entire
+record batch.
+
+### Floating point numbers
+
+We provide 3 types of floating point numbers as fixed bit-width primitive array
+
+- Half precision, 16-bit width
+- Single precision, 32-bit width
+- Double precision, 64-bit width
+
+The IDL looks like:
+
+```
+enum Precision:int {HALF, SINGLE, DOUBLE}
+
+table FloatingPoint {
+  precision: Precision;
+}
+```
+
+### Boolean
+
+The Boolean logical type is represented as a 1-bit wide primitive physical
+type. The bits are numbered using least-significant bit (LSB) ordering.
+
+Like other fixed bit-width primitive types, boolean data appears as 2 buffers
+in the data header (one bitmap for the validity vector and one for the values).
+
+### List
+
+The `List` logical type is the logical (and identically-named) counterpart to
+the List physical type.
+
+In data header form, the list field node contains 2 buffers:
+
+* Validity bitmap
+* List offsets
+
+The buffers associated with a list's child field are handled recursively
+according to the child logical type (e.g. `List<Utf8>` vs. `List<Boolean>`).
+
+### Utf8 and Binary
+
+We specify two logical types for variable length bytes:
+
+* `Utf8` data is unicode values with UTF-8 encoding
+* `Binary` is any other variable length bytes
+
+These types both have the same memory layout as the nested type `List<UInt8>`,
+with the constraint that the inner bytes can contain no null values. From a
+logical type perspective they are primitive, not nested types.
+
+In data header form, while `List<UInt8>` would appear as 2 field nodes (`List`
+and `UInt8`) and 4 buffers (2 for each of the nodes, as per above), these types
+have a simplified representation single field node (of `Utf8` or `Binary`
+logical type, which have no children) and 3 buffers:
+
+* Validity bitmap
+* List offsets
+* Byte data
+
+### Decimal
+
+TBD
+
+### Timestamp
+
+TBD
+
+## Dictionary encoding
+
+[1]: https://github.com/apache/arrow/blob/master/format/Layout.md
+[2]: http://github.com/google/flatbuffers
+[3]: https://github.com/apache/arrow/blob/master/format/Message.fbs

http://git-wip-us.apache.org/repos/asf/arrow/blob/907cc5a1/format/README.md
----------------------------------------------------------------------
diff --git a/format/README.md b/format/README.md
index c84e007..3b0e503 100644
--- a/format/README.md
+++ b/format/README.md
@@ -6,6 +6,7 @@
 
 Currently, the Arrow specification consists of these pieces:
 
+- Metadata specification (see Metadata.md)
 - Physical memory layout specification (see Layout.md)
 - Metadata serialized representation (see Message.fbs)