You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/06/14 13:19:46 UTC

[arrow] branch master updated: ARROW-5342: [Format] Formalize "extension types" in Arrow protocol metadata

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 6fb850c  ARROW-5342: [Format] Formalize "extension types" in Arrow protocol metadata
6fb850c is described below

commit 6fb850cf57fd6227573cca6d43a46e1d5d2b0a66
Author: Wes McKinney <we...@apache.org>
AuthorDate: Fri Jun 14 08:19:38 2019 -0500

    ARROW-5342: [Format] Formalize "extension types" in Arrow protocol metadata
    
    This patch proposes a language-independent scheme for annotating built-in Arrow types with a custom type name and serialized representation, per previous discussions on the mailing list.
    
    I am starting a mailing list discussion to hold a vote about this and see if there are other ideas about how to proceed.
    
    Author: Wes McKinney <we...@apache.org>
    
    Closes #4332 from wesm/ARROW-5342 and squashes the following commits:
    
    ff7ca2c37 <Wes McKinney> Fix formatting issue and missing backtick
    4d0317482 <Wes McKinney> Add language to formalize extension type machinery. Change C++ metadata key names to use ARROW: prefix
---
 cpp/src/arrow/extension_type-test.cc   |  4 +-
 cpp/src/arrow/ipc/metadata-internal.cc |  8 ++--
 docs/source/format/Metadata.rst        | 77 ++++++++++++++++++++++++++++------
 3 files changed, 70 insertions(+), 19 deletions(-)

diff --git a/cpp/src/arrow/extension_type-test.cc b/cpp/src/arrow/extension_type-test.cc
index 90f96cd..6b632a9 100644
--- a/cpp/src/arrow/extension_type-test.cc
+++ b/cpp/src/arrow/extension_type-test.cc
@@ -279,8 +279,8 @@ TEST_F(TestExtensionType, UnrecognizedExtension) {
 
   ASSERT_OK(UnregisterExtensionType("uuid"));
   auto ext_metadata =
-      key_value_metadata({{"arrow_extension_name", "uuid"},
-                          {"arrow_extension_data", "uuid-type-unique-code"}});
+      key_value_metadata({{"ARROW:extension:name", "uuid"},
+                          {"ARROW:extension:metadata", "uuid-type-unique-code"}});
   auto ext_field = field("f0", fixed_size_binary(16), true, ext_metadata);
   auto batch_no_ext = RecordBatch::Make(schema({ext_field}), 4, {storage_arr});
 
diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc
index 1d0ac8a..46f3366 100644
--- a/cpp/src/arrow/ipc/metadata-internal.cc
+++ b/cpp/src/arrow/ipc/metadata-internal.cc
@@ -62,8 +62,8 @@ using Offset = flatbuffers::Offset<void>;
 using FBString = flatbuffers::Offset<flatbuffers::String>;
 using KVVector = flatbuffers::Vector<KeyValueOffset>;
 
-static const char kExtensionTypeKeyName[] = "arrow_extension_name";
-static const char kExtensionDataKeyName[] = "arrow_extension_data";
+static const char kExtensionTypeKeyName[] = "ARROW:extension:name";
+static const char kExtensionMetadataKeyName[] = "ARROW:extension:metadata";
 
 MetadataVersion GetMetadataVersion(flatbuf::MetadataVersion version) {
   switch (version) {
@@ -370,7 +370,7 @@ static Status TypeFromFlatbuffer(const flatbuf::Field* field,
       return Status::OK();
     }
     std::string type_name = field_metadata->value(name_index);
-    int data_index = field_metadata->FindKey(kExtensionDataKeyName);
+    int data_index = field_metadata->FindKey(kExtensionMetadataKeyName);
     std::string type_data = data_index == -1 ? "" : field_metadata->value(data_index);
 
     std::shared_ptr<ExtensionType> type = GetExtensionType(type_name);
@@ -674,7 +674,7 @@ class FieldToFlatbufferVisitor {
   Status Visit(const ExtensionType& type) {
     RETURN_NOT_OK(VisitType(*type.storage_type()));
     extra_type_metadata_[kExtensionTypeKeyName] = type.extension_name();
-    extra_type_metadata_[kExtensionDataKeyName] = type.Serialize();
+    extra_type_metadata_[kExtensionMetadataKeyName] = type.Serialize();
     return Status::OK();
   }
 
diff --git a/docs/source/format/Metadata.rst b/docs/source/format/Metadata.rst
index b6c2a5f..f4be82b 100644
--- a/docs/source/format/Metadata.rst
+++ b/docs/source/format/Metadata.rst
@@ -29,9 +29,6 @@ systems to communicate the
 * "Data headers" indicating the physical locations of memory buffers sufficient
   to reconstruct a Arrow data structures without copying memory.
 
-Canonical implementation
-------------------------
-
 We are using `Flatbuffers`_ for low-overhead reading and writing of the Arrow
 metadata. See ``Message.fbs``.
 
@@ -65,8 +62,8 @@ the columns. The Flatbuffers IDL for a field is: ::
 The ``type`` is the logical type of the field. Nested types, such as List,
 Struct, and Union, have a sequence of child fields.
 
-Record data headers
--------------------
+Record Batch Data Headers
+-------------------------
 
 A record batch is a collection of top-level named, equal length Arrow arrays
 (or vectors). If one of the arrays contains nested data, its child arrays are
@@ -193,12 +190,74 @@ categories:
 Refer to `Schema.fbs`_ for up-to-date descriptions of each built-in
 logical type.
 
+Custom Application Metadata
+---------------------------
+
+We provide a ``custom_metadata`` field at three levels to provide a
+mechanism for developers to pass application-specific metadata in
+Arrow protocol messages. This includes ``Field``, ``Schema``, and
+``Message``.
+
+The colon symbol ``:`` is to be used as a namespace separator. It can
+be used multiple times in a key.
+
+The ``ARROW`` pattern is a reserved namespace for internal Arrow use
+in the ``custom_metadata`` fields. For example,
+``ARROW:extension:name``.
+
+Extension Types
+---------------
+
+User-defined "extension" types can be defined setting certain
+``KeyValue`` pairs in ``custom_metadata`` in the ``Field`` metadata
+structure. These extension keys are:
+
+* ``'ARROW:extension:name'`` for the string name identifying the
+  custom data type. We recommend that you use a "namespace"-style
+  prefix for extension type names to minimize the possibility of
+  conflicts with multiple Arrow readers and writers in the same
+  application. For example, use ``myorg.name_of_type`` instead of
+  simply ``name_of_type``
+* ``'ARROW:extension:metadata'`` for a serialized representation
+  of the ``ExtensionType`` necessary to reconstruct the custom type
+
+This extension metadata can annotate any of the built-in Arrow logical
+types. The intent is that an implementation that does not support an
+extension type can still handle the underlying data. For example a
+16-byte UUID value could be embedded in ``FixedSizeBinary(16)``, and
+implementations that do not have this extension type can still work
+with the underlying binary values and pass along the
+``custom_metadata`` in subsequent Arrow protocol messages.
+
+Extension types may or may not use the
+``'ARROW:extension:metadata'`` field. Let's consider some example
+extension types:
+
+* ``uuid`` represented as ``FixedSizeBinary(16)`` with empty metadata
+* ``latitude-longitude`` represented as ``struct<latitude: double,
+  longitude: double>``, and empty metadata
+* ``tensor`` (multidimensional array) stored as ``Binary`` values and
+  having serialized metadata indicating the data type and shape of
+  each value. This could be JSON like ``{'type': 'int8', 'shape': [4,
+  5]}`` for a 4x5 cell tensor.
+* ``trading-time`` represented as ``Timestamp`` with serialized
+  metadata indicating the market trading calendar the data corresponds
+  to
+
 Integration Testing
 -------------------
 
 A JSON representation of the schema is provided for cross-language
 integration testing purposes.
 
+Schema: ::
+
+    {
+      "fields" : [
+        /* Field */
+      ]
+    }
+
 Field: ::
 
     {
@@ -279,13 +338,5 @@ Interval: ::
       "unit" : "YEAR_MONTH|DAY_TIME"
     }
 
-Schema: ::
-
-    {
-      "fields" : [
-        /* Field */
-      ]
-    }
-
 .. _Flatbuffers: http://github.com/google/flatbuffers
 .. _Schema.fbs: https://github.com/apache/arrow/blob/master/format/Schema.fbs