You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/18 15:48:21 UTC

[GitHub] [arrow] pitrou opened a new pull request #7992: ARROW-9660: [C++] IPC - dictionaries in maps

pitrou opened a new pull request #7992:
URL: https://github.com/apache/arrow/pull/7992


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#issuecomment-681943423


   I've tried the suggested API changes but they would require some annoying C++ wrangling that's basically not worth it. Will merge if CI green.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#issuecomment-680893599


   Rebased, fixed conflict.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#issuecomment-677917549


   Looking.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#issuecomment-681947570


   Thanks. +1 from me


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7992: ARROW-9660: [C++] IPC - dictionaries in maps

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#issuecomment-675559545


   https://issues.apache.org/jira/browse/ARROW-9660


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#discussion_r477340451



##########
File path: cpp/src/arrow/flight/internal.cc
##########
@@ -465,11 +465,9 @@ Status FromProto(const pb::SchemaResult& pb_result, std::string* result) {
 }
 
 Status SchemaToString(const Schema& schema, std::string* out) {
-  // TODO(wesm): Do we care about better memory efficiency here?
   ipc::DictionaryMemo unused_dict_memo;

Review comment:
       Hmm... since there seemed to be nothing concerning about memory efficiency, I thought this TODO was obsolete. What did you have in mind when writing it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#discussion_r477340732



##########
File path: cpp/src/arrow/ipc/dictionary.h
##########
@@ -21,96 +21,145 @@
 
 #include <cstdint>
 #include <memory>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
-#include "arrow/memory_pool.h"
+#include "arrow/result.h"
 #include "arrow/status.h"
+#include "arrow/type_fwd.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/visibility.h"
 
 namespace arrow {
+namespace ipc {
+
+class FieldPosition {
+ public:
+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}
+
+  FieldPosition child(int index) const { return {this, index}; }
+
+  std::vector<int> path() const {
+    std::vector<int> path(depth_);
+    const FieldPosition* cur = this;
+    for (int i = depth_ - 1; i >= 0; --i) {
+      path[i] = cur->index_;
+      cur = cur->parent_;
+    }
+    return path;
+  }
+
+ protected:
+  FieldPosition(const FieldPosition* parent, int index)
+      : parent_(parent), index_(index), depth_(parent->depth_ + 1) {}
+
+  const FieldPosition* parent_;
+  int index_;
+  int depth_;
+};
 
-class Array;
-class DataType;
-class Field;
-class RecordBatch;
+/// \brief Map fields in a schema to dictionary ids
+///
+/// The mapping is structural, i.e. the field path (as a vector of indices)
+/// is associated to the dictionary id.

Review comment:
       Will add.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#discussion_r477341587



##########
File path: cpp/src/arrow/ipc/metadata_internal.h
##########
@@ -48,6 +48,7 @@ namespace flatbuf = org::apache::arrow::flatbuf;
 
 namespace ipc {
 
+class DictionaryFieldMapper;
 class DictionaryMemo;

Review comment:
       `DictionaryMemo` is still passed in some IPC reading APIs, I think. I don't know if those are meant to be public.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7992:
URL: https://github.com/apache/arrow/pull/7992


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7992: ARROW-9660: [C++] IPC - dictionaries in maps

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#discussion_r472355484



##########
File path: cpp/src/arrow/ipc/dictionary.h
##########
@@ -21,34 +21,75 @@
 
 #include <cstdint>
 #include <memory>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
-#include "arrow/memory_pool.h"
+#include "arrow/result.h"
 #include "arrow/status.h"
+#include "arrow/type_fwd.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/visibility.h"
 
 namespace arrow {
+namespace ipc {
 
-class Array;
-class DataType;
-class Field;
-class RecordBatch;
+class FieldPosition {
+ public:
+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}

Review comment:
       @bkietz The idea here is to keep track of the current position in a schema without allocating a `std::vector<int>` for each visited field, only when necessary (e.g. when a dictionary is encountered). What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#issuecomment-680038748


   I'm sorry about the delay on this, I will try to complete this code review today


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#discussion_r474285549



##########
File path: cpp/src/arrow/flight/internal.cc
##########
@@ -465,11 +465,9 @@ Status FromProto(const pb::SchemaResult& pb_result, std::string* result) {
 }
 
 Status SchemaToString(const Schema& schema, std::string* out) {
-  // TODO(wesm): Do we care about better memory efficiency here?
   ipc::DictionaryMemo unused_dict_memo;

Review comment:
       This can be removed now?

##########
File path: cpp/src/arrow/ipc/dictionary.h
##########
@@ -21,96 +21,145 @@
 
 #include <cstdint>
 #include <memory>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
-#include "arrow/memory_pool.h"
+#include "arrow/result.h"
 #include "arrow/status.h"
+#include "arrow/type_fwd.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/visibility.h"
 
 namespace arrow {
+namespace ipc {
+
+class FieldPosition {
+ public:
+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}
+
+  FieldPosition child(int index) const { return {this, index}; }
+
+  std::vector<int> path() const {
+    std::vector<int> path(depth_);
+    const FieldPosition* cur = this;
+    for (int i = depth_ - 1; i >= 0; --i) {
+      path[i] = cur->index_;
+      cur = cur->parent_;
+    }
+    return path;
+  }
+
+ protected:
+  FieldPosition(const FieldPosition* parent, int index)
+      : parent_(parent), index_(index), depth_(parent->depth_ + 1) {}
+
+  const FieldPosition* parent_;
+  int index_;
+  int depth_;
+};
 
-class Array;
-class DataType;
-class Field;
-class RecordBatch;
+/// \brief Map fields in a schema to dictionary ids
+///
+/// The mapping is structural, i.e. the field path (as a vector of indices)
+/// is associated to the dictionary id.

Review comment:
       It would be helpful to state that a dictionary id can be associated with multiple field paths

##########
File path: cpp/src/arrow/ipc/metadata_internal.h
##########
@@ -48,6 +48,7 @@ namespace flatbuf = org::apache::arrow::flatbuf;
 
 namespace ipc {
 
+class DictionaryFieldMapper;
 class DictionaryMemo;

Review comment:
       Perhaps we should move these both to `ipc::internal`?

##########
File path: cpp/src/arrow/ipc/dictionary.h
##########
@@ -21,96 +21,145 @@
 
 #include <cstdint>
 #include <memory>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
-#include "arrow/memory_pool.h"
+#include "arrow/result.h"
 #include "arrow/status.h"
+#include "arrow/type_fwd.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/visibility.h"
 
 namespace arrow {
+namespace ipc {
+
+class FieldPosition {
+ public:
+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}
+
+  FieldPosition child(int index) const { return {this, index}; }
+
+  std::vector<int> path() const {
+    std::vector<int> path(depth_);
+    const FieldPosition* cur = this;
+    for (int i = depth_ - 1; i >= 0; --i) {
+      path[i] = cur->index_;
+      cur = cur->parent_;
+    }
+    return path;
+  }
+
+ protected:
+  FieldPosition(const FieldPosition* parent, int index)
+      : parent_(parent), index_(index), depth_(parent->depth_ + 1) {}
+
+  const FieldPosition* parent_;
+  int index_;
+  int depth_;
+};
 
-class Array;
-class DataType;
-class Field;
-class RecordBatch;
+/// \brief Map fields in a schema to dictionary ids
+///
+/// The mapping is structural, i.e. the field path (as a vector of indices)
+/// is associated to the dictionary id.
+class ARROW_EXPORT DictionaryFieldMapper {
+ public:
+  DictionaryFieldMapper();
+  explicit DictionaryFieldMapper(const Schema& schema);
+  ~DictionaryFieldMapper();
 
-namespace ipc {
+  Status AddSchemaFields(const Schema& schema);
+  Status AddField(int64_t id, std::vector<int> field_path);
 
-/// \brief Memoization data structure for assigning id numbers to
-/// dictionaries and tracking their current state through possible
-/// deltas in an IPC stream
+  Result<int64_t> GetFieldId(std::vector<int> field_path) const;
+
+  int num_fields() const;
+
+ private:
+  struct Impl;
+  std::unique_ptr<Impl> impl_;
+};
+
+using DictionaryVector = std::vector<std::pair<int64_t, std::shared_ptr<Array>>>;
+
+/// \brief Memoization data structure for reading dictionaries from IPC streams
+///
+/// This structure tracks the following associations:
+/// - field position (structural) -> dictionary id
+/// - dictionary id -> value type
+/// - dictionary id -> dictionary (value) data
+///
+/// Together, they allow resolving dictionary data when reading an IPC stream,
+/// using metadata recorded in the schema message and data recorded in the
+/// dictionary batch messages (see ResolveDictionaries).
+///
+/// This structure isn't useful for writing an IPC stream, where only
+/// DictionaryFieldMapper is necessary.
 class ARROW_EXPORT DictionaryMemo {
  public:
-  using DictionaryVector = std::vector<std::pair<int64_t, std::shared_ptr<Array>>>;
-
   DictionaryMemo();
-  DictionaryMemo(DictionaryMemo&&) = default;
-  DictionaryMemo& operator=(DictionaryMemo&&) = default;
+  ~DictionaryMemo();
+
+  DictionaryFieldMapper& fields();

Review comment:
       Is this necessary? It might be clearer to have `DictionaryFieldMapper* mutable_fields()`

##########
File path: cpp/src/arrow/ipc/dictionary.h
##########
@@ -21,34 +21,75 @@
 
 #include <cstdint>
 #include <memory>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
-#include "arrow/memory_pool.h"
+#include "arrow/result.h"
 #include "arrow/status.h"
+#include "arrow/type_fwd.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/visibility.h"
 
 namespace arrow {
+namespace ipc {
 
-class Array;
-class DataType;
-class Field;
-class RecordBatch;
+class FieldPosition {
+ public:
+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}

Review comment:
       Seems reasonable to me

##########
File path: cpp/src/arrow/ipc/dictionary.cc
##########
@@ -30,6 +31,14 @@
 #include "arrow/status.h"
 #include "arrow/type.h"
 #include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace std {
+template <>
+struct hash<arrow::FieldPath> {
+  size_t operator()(const arrow::FieldPath& path) const { return path.hash(); }
+};

Review comment:
       This might be better placed in a common header, but this is ok for now




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7992: ARROW-9660: [C++] Revamp dictionary association in IPC

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7992:
URL: https://github.com/apache/arrow/pull/7992#discussion_r477341153



##########
File path: cpp/src/arrow/ipc/dictionary.h
##########
@@ -21,96 +21,145 @@
 
 #include <cstdint>
 #include <memory>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
-#include "arrow/memory_pool.h"
+#include "arrow/result.h"
 #include "arrow/status.h"
+#include "arrow/type_fwd.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/visibility.h"
 
 namespace arrow {
+namespace ipc {
+
+class FieldPosition {
+ public:
+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}
+
+  FieldPosition child(int index) const { return {this, index}; }
+
+  std::vector<int> path() const {
+    std::vector<int> path(depth_);
+    const FieldPosition* cur = this;
+    for (int i = depth_ - 1; i >= 0; --i) {
+      path[i] = cur->index_;
+      cur = cur->parent_;
+    }
+    return path;
+  }
+
+ protected:
+  FieldPosition(const FieldPosition* parent, int index)
+      : parent_(parent), index_(index), depth_(parent->depth_ + 1) {}
+
+  const FieldPosition* parent_;
+  int index_;
+  int depth_;
+};
 
-class Array;
-class DataType;
-class Field;
-class RecordBatch;
+/// \brief Map fields in a schema to dictionary ids
+///
+/// The mapping is structural, i.e. the field path (as a vector of indices)
+/// is associated to the dictionary id.
+class ARROW_EXPORT DictionaryFieldMapper {
+ public:
+  DictionaryFieldMapper();
+  explicit DictionaryFieldMapper(const Schema& schema);
+  ~DictionaryFieldMapper();
 
-namespace ipc {
+  Status AddSchemaFields(const Schema& schema);
+  Status AddField(int64_t id, std::vector<int> field_path);
 
-/// \brief Memoization data structure for assigning id numbers to
-/// dictionaries and tracking their current state through possible
-/// deltas in an IPC stream
+  Result<int64_t> GetFieldId(std::vector<int> field_path) const;
+
+  int num_fields() const;
+
+ private:
+  struct Impl;
+  std::unique_ptr<Impl> impl_;
+};
+
+using DictionaryVector = std::vector<std::pair<int64_t, std::shared_ptr<Array>>>;
+
+/// \brief Memoization data structure for reading dictionaries from IPC streams
+///
+/// This structure tracks the following associations:
+/// - field position (structural) -> dictionary id
+/// - dictionary id -> value type
+/// - dictionary id -> dictionary (value) data
+///
+/// Together, they allow resolving dictionary data when reading an IPC stream,
+/// using metadata recorded in the schema message and data recorded in the
+/// dictionary batch messages (see ResolveDictionaries).
+///
+/// This structure isn't useful for writing an IPC stream, where only
+/// DictionaryFieldMapper is necessary.
 class ARROW_EXPORT DictionaryMemo {
  public:
-  using DictionaryVector = std::vector<std::pair<int64_t, std::shared_ptr<Array>>>;
-
   DictionaryMemo();
-  DictionaryMemo(DictionaryMemo&&) = default;
-  DictionaryMemo& operator=(DictionaryMemo&&) = default;
+  ~DictionaryMemo();
+
+  DictionaryFieldMapper& fields();

Review comment:
       Well, either this, or we duplicate the `DictionaryFieldMapper` APIs here. I have no particular preference.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org