You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/19 14:07:16 UTC

[GitHub] [arrow] jvanstraten commented on a diff in pull request #12852: ARROW-15583: [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors

jvanstraten commented on code in PR #12852:
URL: https://github.com/apache/arrow/pull/12852#discussion_r853049018


##########
cpp/src/arrow/engine/substrait/extension_set.h:
##########
@@ -158,14 +178,14 @@ class ARROW_ENGINE_EXPORT ExtensionSet {
   /// An extension set should instead be created using
   /// arrow::engine::GetExtensionSetFromPlan
   static Result<ExtensionSet> Make(
-      std::vector<util::string_view> uris, std::vector<Id> type_ids,
+      std::unordered_map<uint32_t, util::string_view> uris, std::vector<Id> type_ids,
       std::vector<bool> type_is_variation, std::vector<Id> function_ids,
       ExtensionIdRegistry* = default_extension_id_registry());
 
   // index in these vectors == value of _anchor/_reference fields
   /// TODO(ARROW-15583) this assumes that _anchor/_references won't be huge, which is not
   /// guaranteed. Could it be?

Review Comment:
   This comment is no longer applicable.



##########
cpp/src/arrow/engine/substrait/extension_set.h:
##########
@@ -226,14 +246,10 @@ class ARROW_ENGINE_EXPORT ExtensionSet {
  private:
   ExtensionIdRegistry* registry_;
   /// The subset of extension registry URIs referenced by this extension set
-  std::vector<util::string_view> uris_;
+  std::unordered_map<uint32_t, util::string_view> uris_;
   std::vector<TypeRecord> types_;
-
   std::vector<FunctionRecord> functions_;

Review Comment:
   These should also be converted to `unordered_map`s; in fact, they are more important to convert, as there will be lots more functions and types than URIs in a typical plan.



##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element, std::vector<T>* vector) {
   }
   (*vector)[i] = static_cast<T>(element);
 }

Review Comment:
   When the code is reimplemented as it should be, this function should become dead code (and be removed), as it is what's causing the potentially large allocations.



##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element, std::vector<T>* vector) {
   }
   (*vector)[i] = static_cast<T>(element);
 }
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key, value>* map) {
+  DCHECK_LE(i, 1 << 20);
+  if (i >= map->size()) {
+    map->reserve(i + 1);
+  }

Review Comment:
   This is no longer applicable when a map is used. The `reserve()` call is actually highly counterproductive, since a map with the single key 100000 will reserve room for 100001 key-value pairs (this is, in fact, the core problem that needs solving).
   
   Once these lines are removed, I'd argue that the whole function reduces to nothing and should just be inlined.



##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element, std::vector<T>* vector) {
   }
   (*vector)[i] = static_cast<T>(element);
 }
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key, value>* map) {
+  DCHECK_LE(i, 1 << 20);
+  if (i >= map->size()) {
+    map->reserve(i + 1);
+  }
+  (*map)[i] = static_cast<value>(element);
+}
+
 }  // namespace
 
 Result<ExtensionSet> GetExtensionSetFromPlan(const substrait::Plan& plan,
                                              ExtensionIdRegistry* registry) {
-  std::vector<util::string_view> uris;
+  std::unordered_map<uint32_t, util::string_view> uris;
   for (const auto& uri : plan.extension_uris()) {
-    SetElement(uri.extension_uri_anchor(), uri.uri(), &uris);
+    SetMapElement(uri.extension_uri_anchor(), uri.uri(), &uris);
   }
 
   // NOTE: it's acceptable to use views to memory owned by plan; ExtensionSet::Make

Review Comment:
   GitHub won't let me add comments to code this far away from a change, but these vectors a few lines further down (as well as the equivalent fields of `ExtensionSet`) should also be replaced with maps:
   
   ```c++
     std::vector<Id> type_ids, function_ids;
     std::vector<bool> type_is_variation;
   ```
   
   When you change this and everything affected by it, you'll find that `ExtensionSet::EncodeType` and `ExtensionSet::EncodeFunction` won't work anymore as written. Functionally, they include code to generate a key that wasn't used before, which is easy with a vector, but for a map would require searching for an unused key by iterating over all possible keys until you find one. It might therefore be better to use an ordered `map`, in which case you can use something like `map::rbegin()->first + 1` for the typical case (though you'd still need to implement the empty map and overflow edge cases; in the former case you should return 1, and in the latter case you'd need to exhaustively search for a free key anyway). Don't use anchor 0 (the current implementation probably does); that anchor value is reserved for undefined references in some contexts. If you really want to use an `unordered_map` instead of a `map`, you can also keep track of the latest key that was added to the map and use
  it as a start position for looking for free anchors; that'd still be pretty efficient in the typical case, but requires a bit more bookkeeping.
   
   Furthermore, as I pointed out in my comment to the JIRA, using a single anchor map for both types and type variations, and then using `type_is_variation` to distinguish between types and type variation, was a misinterpretation of the Substrait specification as far as I'm concerned: it's perfectly valid (or at least, unambiguous) for a plan to define a type with anchor 1 as well as a completely unrelated type variation with anchor 1. To be clear, a correct, complete implementation should look something like this:
   
   ```c++
     std::[unordered_]map<uint32_t, Id> type_ids, type_variation_ids, function_ids;
   ```
   
   That's a lot more involved though, as the assumption that a user-defined type or type variation is uniquely identified by a single number is pretty deeply ingrained in the round-tripping code. However, since type variations are not (correctly) implemented anyway, it's probably better for now to just remove the broken and incomplete code related to type variations for now, so it'd just become
   
   ```c++
     std::[unordered_]map<uint32_t, Id> type_ids, function_ids;
   ```
   
   On the consumer end, a nonzero type variation reference [already yields an error](https://github.com/sanjibansg/arrow/blob/b6b445c16f7829a26b91ab447e36d8cddbc32ce3/cpp/src/arrow/engine/substrait/type_internal.cc#L43) (this is why anchor 0 should not be used, by the way). The type variation anchor definitions in the `kExtensionTypeVariation` case in the loop below could just be silently ignored (which is what I would personally do), but it'd be more in line with the rest of the implementation to emit an error when encountering those too, as the implementation aims to error out on anything that wouldn't round-trip perfectly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org