You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2017/11/04 00:55:39 UTC

[arrow] branch master updated: ARROW-1727: [Format] Expand Arrow streaming format to permit deltas / additions to existing dictionaries

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 5d66576  ARROW-1727: [Format] Expand Arrow streaming format to permit deltas / additions to existing dictionaries
5d66576 is described below

commit 5d665762cd8c6ebbe94ce39b435a63ca4cf15967
Author: Brian Hulette <br...@ccri.com>
AuthorDate: Fri Nov 3 20:55:27 2017 -0400

    ARROW-1727: [Format] Expand Arrow streaming format to permit deltas / additions to existing dictionaries
    
    Add an `isDelta` flag to the `DictionaryBatch` to allow for dictionary modifications mid-stream, update documentation.
    
    Author: Brian Hulette <br...@ccri.com>
    
    Closes #1257 from TheNeuralBit/ARROW-1727 and squashes the following commits:
    
    c69a5539 [Brian Hulette] Documentation tweaks
    3dff0a9c [Brian Hulette] Add isDelta flag to DictionaryBatch, update documentation
---
 format/IPC.md      | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 format/Layout.md   |  6 +++---
 format/Message.fbs | 10 +++++++---
 3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/format/IPC.md b/format/IPC.md
index 2f79031..f3b4885 100644
--- a/format/IPC.md
+++ b/format/IPC.md
@@ -67,7 +67,9 @@ We provide a streaming format for record batches. It is presented as a sequence
 of encapsulated messages, each of which follows the format above. The schema
 comes first in the stream, and it is the same for all of the record batches
 that follow. If any fields in the schema are dictionary-encoded, one or more
-`DictionaryBatch` messages will follow the schema.
+`DictionaryBatch` messages will be included. `DictionaryBatch` and
+`RecordBatch` messages may be interleaved, but before any dictionary key is used
+in a `RecordBatch` it should be defined in a `DictionaryBatch`.
 
 ```
 <SCHEMA>
@@ -76,6 +78,10 @@ that follow. If any fields in the schema are dictionary-encoded, one or more
 <DICTIONARY k - 1>
 <RECORD BATCH 0>
 ...
+<DICTIONARY x DELTA>
+...
+<DICTIONARY y DELTA>
+...
 <RECORD BATCH n - 1>
 <EOS [optional]: int32>
 ```
@@ -109,6 +115,10 @@ Schematically we have:
 <magic number "ARROW1">
 ```
 
+In the file format, there is no requirement that dictionary keys should be
+defined in a `DictionaryBatch` before they are used in a `RecordBatch`, as long
+as the keys are defined somewhere in the file.
+
 ### RecordBatch body structure
 
 The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of
@@ -181,6 +191,7 @@ the dictionaries can be properly interpreted.
 table DictionaryBatch {
   id: long;
   data: RecordBatch;
+  isDelta: boolean = false;
 }
 ```
 
@@ -189,6 +200,38 @@ in the schema, so that dictionaries can even be used for multiple fields. See
 the [Physical Layout][4] document for more about the semantics of
 dictionary-encoded data.
 
+The dictionary `isDelta` flag allows dictionary batches to be modified
+mid-stream.  A dictionary batch with `isDelta` set indicates that its vector
+should be concatenated with those of any previous batches with the same `id`. A
+stream which encodes one column, the list of strings
+`["A", "B", "C", "B", "D", "C", "E", "A"]`, with a delta dictionary batch could
+take the form:
+
+```
+<SCHEMA>
+<DICTIONARY 0>
+(0) "A"
+(1) "B"
+(2) "C"
+
+<RECORD BATCH 0>
+0
+1
+2
+1
+
+<DICTIONARY 0 DELTA>
+(3) "D"
+(4) "E"
+
+<RECORD BATCH 1>
+3
+2
+4
+0
+EOS
+```
+
 ### Tensor (Multi-dimensional Array) Message Format
 
 The `Tensor` message types provides a way to write a multidimensional array of
diff --git a/format/Layout.md b/format/Layout.md
index ebf9382..963202f 100644
--- a/format/Layout.md
+++ b/format/Layout.md
@@ -615,9 +615,9 @@ the the types array indicates that a slot contains a different type at the index
 ## Dictionary encoding
 
 When a field is dictionary encoded, the values are represented by an array of Int32 representing the index of the value in the dictionary.
-The Dictionary is received as a DictionaryBatch whose id is referenced by a dictionary attribute defined in the metadata ([Message.fbs][7]) in the Field table.
-The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatch.
-When a Schema references a Dictionary id, it must send a DictionaryBatch for this id before any RecordBatch.
+The Dictionary is received as one or more DictionaryBatches with the id referenced by a dictionary attribute defined in the metadata ([Message.fbs][7]) in the Field table.
+The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatches.
+When a Schema references a Dictionary id, it must send at least one DictionaryBatch for this id.
 
 As an example, you could have the following data:
 ```
diff --git a/format/Message.fbs b/format/Message.fbs
index f4a9571..8307181 100644
--- a/format/Message.fbs
+++ b/format/Message.fbs
@@ -61,16 +61,20 @@ table RecordBatch {
   buffers: [Buffer];
 }
 
-/// ----------------------------------------------------------------------
 /// For sending dictionary encoding information. Any Field can be
 /// dictionary-encoded, but in this case none of its children may be
 /// dictionary-encoded.
-/// There is one vector / column per dictionary
-///
+/// There is one vector / column per dictionary, but that vector / column
+/// may be spread across multiple dictionary batches by using the isDelta
+/// flag
 
 table DictionaryBatch {
   id: long;
   data: RecordBatch;
+
+  /// If isDelta is true the values in the dictionary are to be appended to a
+  /// dictionary with the indicated id
+  isDelta: bool = false;
 }
 
 /// ----------------------------------------------------------------------

-- 
To stop receiving notification emails like this one, please contact
['"commits@arrow.apache.org" <co...@arrow.apache.org>'].