You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2017/04/10 12:30:32 UTC

arrow git commit: ARROW-526: [Format] Revise Format documents for evolution in IPC stream / file / tensor formats

Repository: arrow
Updated Branches:
  refs/heads/master acbda1893 -> ddda3039e


ARROW-526: [Format] Revise Format documents for evolution in IPC stream / file / tensor formats

Author: Wes McKinney <we...@twosigma.com>

Closes #515 from wesm/ARROW-526 and squashes the following commits:

6a38432 [Wes McKinney] Typo
5d564a6 [Wes McKinney] Revise Format documents for evolution in IPC stream / file / tensor formats


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/ddda3039
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/ddda3039
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/ddda3039

Branch: refs/heads/master
Commit: ddda3039e6fb6a9d4f2c5b1189369204bfe1ea93
Parents: acbda18
Author: Wes McKinney <we...@twosigma.com>
Authored: Mon Apr 10 08:30:27 2017 -0400
Committer: Wes McKinney <we...@twosigma.com>
Committed: Mon Apr 10 08:30:27 2017 -0400

----------------------------------------------------------------------
 format/IPC.md      | 131 +++++++++++++++++++++++++++++++++++-------------
 format/Metadata.md |  57 ++++++++++++++++++---
 2 files changed, 146 insertions(+), 42 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/ddda3039/format/IPC.md
----------------------------------------------------------------------
diff --git a/format/IPC.md b/format/IPC.md
index d386e60..f0a67e2 100644
--- a/format/IPC.md
+++ b/format/IPC.md
@@ -14,65 +14,106 @@
 
 # Interprocess messaging / communication (IPC)
 
-## File format
+## Encapsulated message format
+
+Data components in the stream and file formats are represented as encapsulated
+*messages* consisting of:
 
-We define a self-contained "file format" containing an Arrow schema along with
-one or more record batches defining a dataset. See [format/File.fbs][1] for the
-precise details of the file metadata.
+* A length prefix indicating the metadata size
+* The message metadata as a [Flatbuffer][3]
+* Padding bytes to an 8-byte boundary
+* The message body
 
-In general, the file looks like:
+Schematically, we have:
 
 ```
-<magic number "ARROW1">
-<empty padding bytes [to 8 byte boundary]>
+<metadata_size: int32>
+<metadata_flatbuffer: bytes>
+<padding>
+<message body>
+```
+
+The `metadata_size` includes the size of the flatbuffer plus padding. The
+`Message` flatbuffer includes a version number, the particular message (as a
+flatbuffer union), and the size of the message body:
+
+```
+table Message {
+  version: org.apache.arrow.flatbuf.MetadataVersion;
+  header: MessageHeader;
+  bodyLength: long;
+}
+```
+
+Currently, we support 4 types of messages:
+
+* Schema
+* RecordBatch
+* DictionaryBatch
+* Tensor
+
+## Streaming format
+
+We provide a streaming format for record batches. It is presented as a sequence
+of encapsulated messages, each of which follows the format above. The schema
+comes first in the stream, and it is the same for all of the record batches
+that follow. If any fields in the schema are dictionary-encoded, one or more
+`DictionaryBatch` messages will follow the schema.
+
+```
+<SCHEMA>
 <DICTIONARY 0>
 ...
 <DICTIONARY k - 1>
 <RECORD BATCH 0>
 ...
 <RECORD BATCH n - 1>
-<METADATA org.apache.arrow.flatbuf.Footer>
-<metadata_size: int32>
-<magic number "ARROW1">
+<EOS [optional]: int32>
 ```
 
-See the File.fbs document for details about the Flatbuffers metadata. The
-record batches have a particular structure, defined next.
+When a stream reader implementation is reading a stream, after each message, it
+may read the next 4 bytes to know how large the message metadata that follows
+is. Once the message flatbuffer is read, you can then read the message body.
+
+The stream writer can signal end-of-stream (EOS) either by writing a 0 length
+as an `int32` or simply closing the stream interface.
+
+## File format
 
-### Record batches
+We define a "file format" supporting random access in a very similar format to
+the streaming format. The file starts and ends with a magic string `ARROW1`
+(plus padding). What follows in the file is identical to the stream format. At
+the end of the file, we write a *footer* including offsets and sizes for each
+of the data blocks in the file, so that random access is possible. See
+[format/File.fbs][1] for the precise details of the file footer.
 
-The record batch metadata is written as a flatbuffer (see
-[format/Message.fbs][2] -- the RecordBatch message type) prefixed by its size,
-followed by each of the memory buffers in the batch written end to end (with
-appropriate alignment and padding):
+Schematically we have:
 
 ```
-<int32: metadata flatbuffer size>
-<metadata: org.apache.arrow.flatbuf.RecordBatch>
-<padding bytes [to 8-byte boundary]>
-<body: buffers end to end>
+<magic number "ARROW1">
+<empty padding bytes [to 8 byte boundary]>
+<STREAMING FORMAT>
+<FOOTER>
+<FOOTER SIZE: int32>
+<magic number "ARROW1">
 ```
 
+### RecordBatch body structure
+
 The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of
 field metadata and physical memory buffers (some comments from [Message.fbs][2]
 have been shortened / removed):
 
 ```
 table RecordBatch {
-  length: int;
+  length: long;
   nodes: [FieldNode];
   buffers: [Buffer];
 }
 
 struct FieldNode {
-  /// The number of value slots in the Arrow array at this level of a nested
-  /// tree
-  length: int;
-
-  /// The number of observed nulls. Fields with null_count == 0 may choose not
-  /// to write their physical validity bitmap out as a materialized buffer,
-  /// instead setting the length of the bitmap buffer to 0.
-  null_count: int;
+  length: long;
+  null_count: long;
 }
 
 struct Buffer {
@@ -91,9 +132,9 @@ struct Buffer {
 ```
 
 In the context of a file, the `page` is not used, and the `Buffer` offsets use
-as a frame of reference the start of the segment where they are written in the
-file. So, while in a general IPC setting these offsets may be anyplace in one
-or more shared memory regions, in the file format the offsets start from 0.
+as a frame of reference the start of the message body. So, while in a general
+IPC setting these offsets may be anyplace in one or more shared memory regions,
+in the file format the offsets start from 0.
 
 The location of a record batch and the size of the metadata block as well as
 the body of buffers is stored in the file footer:
@@ -112,12 +153,30 @@ Some notes about this
 * The metadata length includes the flatbuffer size, the record batch metadata
   flatbuffer, and any padding bytes
 
-
-### Dictionary batches
+### Dictionary Batches
 
 Dictionary batches have not yet been implemented, while they are provided for
 in the metadata. For the time being, the `DICTIONARY` segments shown above in
 the file do not appear in any of the file implementations.
 
+### Tensor (Multi-dimensional Array) Message Format
+
+The `Tensor` message types provides a way to write a multidimensional array of
+fixed-size values (such as a NumPy ndarray) using Arrow's shared memory
+tools. Arrow implementations in general are not required to implement this data
+format, though we provide a reference implementation in C++.
+
+When writing a standalone encapsulated tensor message, we use the format as
+indicated above, but additionally align the starting offset (if writing to a
+shared memory region) to be a multiple of 8:
+
+```
+<PADDING>
+<metadata size: int32>
+<metadata>
+<tensor body>
+```
+
 [1]: https://github.com/apache/arrow/blob/master/format/File.fbs
-[1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
\ No newline at end of file
+[2]: https://github.com/apache/arrow/blob/master/format/Message.fbs
+[3]: https://github.com/google]/flatbuffers

http://git-wip-us.apache.org/repos/asf/arrow/blob/ddda3039/format/Metadata.md
----------------------------------------------------------------------
diff --git a/format/Metadata.md b/format/Metadata.md
index a4878f3..18fac52 100644
--- a/format/Metadata.md
+++ b/format/Metadata.md
@@ -86,8 +86,8 @@ VectorLayout:
 Type:
 ```
 {
-  "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|bool|decimal|date|time|timestamp|interval"
-  // fields as defined in the flatbuff depending on the type name
+  "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
+  // fields as defined in the Flatbuffer depending on the type name
 }
 ```
 Union:
@@ -126,14 +126,37 @@ Decimal:
   "scale" : /* integer */
 }
 ```
+
 Timestamp:
+
 ```
 {
   "name" : "timestamp",
   "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
 }
 ```
+
+Date:
+
+```
+{
+  "name" : "date",
+  "unit" : "DAY|MILLISECOND"
+}
+```
+
+Time:
+
+```
+{
+  "name" : "time",
+  "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
+  "bitWidth": /* integer: 32 or 64 */
+}
+```
+
 Interval:
+
 ```
 {
   "name" : "interval",
@@ -161,12 +184,16 @@ Flatbuffers IDL for a record batch data header
 
 ```
 table RecordBatch {
-  length: int;
+  length: long;
   nodes: [FieldNode];
   buffers: [Buffer];
 }
 ```
 
+The `RecordBatch` metadata provides for record batches with length exceeding
+2^31 - 1, but Arrow implementations are not required to implement support
+beyond this size.
+
 The `nodes` and `buffers` fields are produced by a depth-first traversal /
 flattening of a schema (possibly containing nested types) for a given in-memory
 data set.
@@ -205,13 +232,17 @@ hierarchy.
 struct FieldNode {
   /// The number of value slots in the Arrow array at this level of a nested
   /// tree
-  length: int;
+  length: long;
 
   /// The number of observed nulls.
-  null_count: int;
+  null_count: lohng;
 }
 ```
 
+The `FieldNode` metadata provides for fields with length exceeding 2^31 - 1,
+but Arrow implementations are not required to implement support for large
+arrays.
+
 ## Flattening of nested data
 
 Nested types are flattened in the record batch in depth-first order. When
@@ -359,7 +390,21 @@ TBD
 
 ### Timestamp
 
-TBD
+All timestamps are stored as a 64-bit integer, with one of four unit
+resolutions: second, millisecond, microsecond, and nanosecond.
+
+### Date
+
+We support two different date types:
+
+* Days since the UNIX epoch as a 32-bit integer
+* Milliseconds since the UNIX epoch as a 64-bit integer
+
+### Time
+
+Time supports the same unit resolutions: second, millisecond, microsecond, and
+nanosecond. We represent time as the smallest integer accommodating the
+indicated unit. For second and millisecond: 32-bit, for the others 64-bit.
 
 ## Dictionary encoding