You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2017/01/23 14:13:45 UTC

arrow git commit: ARROW-81: [Format] Augment dictionary encoding metadata to accommodate additional use cases

Repository: arrow
Updated Branches:
  refs/heads/master 282103012 -> 085c8754b


ARROW-81: [Format] Augment dictionary encoding metadata to accommodate additional use cases

cc @julienledem @nongli @jacques-n. I am hoping to close the loop on our discussion in https://issues.apache.org/jira/browse/ARROW-81. In my applications, I need the flexibility to transmit:

* Dictionaries encoded in signed integers smaller than int32. For example, with 10 dictionary values, we may send int8 indices
* Indicator that the dictionary is ordered

These features are needed for Python and R support, and in general for statistical computing applications.

Author: Wes McKinney <we...@twosigma.com>

Closes #297 from wesm/ARROW-81 and squashes the following commits:

c960bac [Wes McKinney] Augment dictionary encoding metadata to accommodate additional use cases


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/085c8754
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/085c8754
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/085c8754

Branch: refs/heads/master
Commit: 085c8754b0ab2da7fcd245fc88bc4de9a6806a4c
Parents: 2821030
Author: Wes McKinney <we...@twosigma.com>
Authored: Mon Jan 23 09:13:39 2017 -0500
Committer: Wes McKinney <we...@twosigma.com>
Committed: Mon Jan 23 09:13:39 2017 -0500

----------------------------------------------------------------------
 format/Message.fbs | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/085c8754/format/Message.fbs
----------------------------------------------------------------------
diff --git a/format/Message.fbs b/format/Message.fbs
index b2c6464..028c56a 100644
--- a/format/Message.fbs
+++ b/format/Message.fbs
@@ -151,6 +151,26 @@ table KeyValue {
 }
 
 /// ----------------------------------------------------------------------
+/// Dictionary encoding metadata
+
+table DictionaryEncoding {
+  /// The known dictionary id in the application where this data is used. In
+  /// the file or streaming formats, the dictionary ids are found in the
+  /// DictionaryBatch messages
+  id: long;
+
+  /// The dictionary indices are constrained to be positive integers. If this
+  /// field is null, the indices must be signed int32
+  indexType: Int;
+
+  /// By default, dictionaries are not ordered, or the order does not have
+  /// semantic meaning. In some statistical, applications, dictionary-encoding
+  /// is used to represent ordered categorical data, and we provide a way to
+  /// preserve that metadata here
+  isOrdered: bool;
+}
+
+/// ----------------------------------------------------------------------
 /// A field represents a named column in a record / row batch or child of a
 /// nested type.
 ///
@@ -163,9 +183,10 @@ table Field {
   name: string;
   nullable: bool;
   type: Type;
-  // present only if the field is dictionary encoded
-  // will point to a dictionary provided by a DictionaryBatch message
-  dictionary: long;
+
+  // Present only if the field is dictionary encoded
+  dictionary: DictionaryEncoding;
+
   // children apply only to Nested data types like Struct, List and Union
   children: [Field];
   /// layout of buffers produced for this type (as derived from the Type)