You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2017/01/23 14:13:45 UTC
arrow git commit: ARROW-81: [Format] Augment dictionary encoding
metadata to accommodate additional use cases
Repository: arrow
Updated Branches:
refs/heads/master 282103012 -> 085c8754b
ARROW-81: [Format] Augment dictionary encoding metadata to accommodate additional use cases
cc @julienledem @nongli @jacques-n. I am hoping to close the loop on our discussion in https://issues.apache.org/jira/browse/ARROW-81. In my applications, I need the flexibility to transmit:
* Dictionaries encoded in signed integers smaller than int32. For example, with 10 dictionary values, we may send int8 indices
* Indicator that the dictionary is ordered
These features are needed for Python and R support, and in general for statistical computing applications.
Author: Wes McKinney <we...@twosigma.com>
Closes #297 from wesm/ARROW-81 and squashes the following commits:
c960bac [Wes McKinney] Augment dictionary encoding metadata to accommodate additional use cases
Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/085c8754
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/085c8754
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/085c8754
Branch: refs/heads/master
Commit: 085c8754b0ab2da7fcd245fc88bc4de9a6806a4c
Parents: 2821030
Author: Wes McKinney <we...@twosigma.com>
Authored: Mon Jan 23 09:13:39 2017 -0500
Committer: Wes McKinney <we...@twosigma.com>
Committed: Mon Jan 23 09:13:39 2017 -0500
----------------------------------------------------------------------
format/Message.fbs | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/arrow/blob/085c8754/format/Message.fbs
----------------------------------------------------------------------
diff --git a/format/Message.fbs b/format/Message.fbs
index b2c6464..028c56a 100644
--- a/format/Message.fbs
+++ b/format/Message.fbs
@@ -151,6 +151,26 @@ table KeyValue {
}
/// ----------------------------------------------------------------------
+/// Dictionary encoding metadata
+
+table DictionaryEncoding {
+ /// The known dictionary id in the application where this data is used. In
+ /// the file or streaming formats, the dictionary ids are found in the
+ /// DictionaryBatch messages
+ id: long;
+
+ /// The dictionary indices are constrained to be positive integers. If this
+ /// field is null, the indices must be signed int32
+ indexType: Int;
+
+ /// By default, dictionaries are not ordered, or the order does not have
+ /// semantic meaning. In some statistical, applications, dictionary-encoding
+ /// is used to represent ordered categorical data, and we provide a way to
+ /// preserve that metadata here
+ isOrdered: bool;
+}
+
+/// ----------------------------------------------------------------------
/// A field represents a named column in a record / row batch or child of a
/// nested type.
///
@@ -163,9 +183,10 @@ table Field {
name: string;
nullable: bool;
type: Type;
- // present only if the field is dictionary encoded
- // will point to a dictionary provided by a DictionaryBatch message
- dictionary: long;
+
+ // Present only if the field is dictionary encoded
+ dictionary: DictionaryEncoding;
+
// children apply only to Nested data types like Struct, List and Union
children: [Field];
/// layout of buffers produced for this type (as derived from the Type)