You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/20 06:25:17 UTC

[GitHub] [arrow] ljishen opened a new pull request, #14681: ARROW-18367: Enable using InMemoryDataset to create substrait plans

ljishen opened a new pull request, #14681:
URL: https://github.com/apache/arrow/pull/14681

   - changes also include fixes from running IWYU over the directory cpp/src/arrow/engine/substrait


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1339896379

   I'm not caught up on the Substrait work. I believe Weston has just been busy lately and should be able to catch up. @westonpace, let me know if you want me to step in here instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ljishen commented on pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
ljishen commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1329991273

   > I have a few requests. This seems like a fairly clever way to produce named table relations.
   > 
   > The Substrait serializer works on declarations. There is no need for a declaration to have a node factory (or even have a corresponding exec node). Would it be simpler to do something like:
   > 
   > ```
   > /// Declare a named table, there is currently no way to consume this node
   > class NamedTableOptions {
   >   public:
   >     std::vector<std::string> names;
   >     std::shared_ptr<Schema> table_schema;
   > };
   > 
   > ...
   > 
   > Declaration named_tbl{"named_table", NamedTableOptions{{"table", "1"}, schema}};
   > 
   > ...
   > 
   > Result<compute::ExecNode*> MakeNamedTableNode(compute::ExecPlan* plan,
   >                                                std::vector<compute::ExecNode*> inputs,
   >                                                const compute::ExecNodeOptions& options) {
   >   return Status::Invalid("The named table node is for serialization purposes only and can never be converted into an exec plan or executed");
   > }
   > ...
   > 
   > DCHECK_OK(registry->AddFactory("named_table", MakeNamedTableNode));
   > ```
   
   Yes, I believe this is a better way to solve this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ljishen commented on pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
ljishen commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1339942554

   > I'm not caught up on the Substrait work. I believe Weston has just been busy lately and should be able to catch up. @westonpace, let me know if you want me to step in here instead.
   
   Thanks for your response.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #14681: ARROW-18367: [C++] Enable using InMemoryDataset to create substrait plans

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #14681:
URL: https://github.com/apache/arrow/pull/14681#discussion_r1028307635


##########
cpp/src/arrow/engine/substrait/expression_internal.cc:
##########
@@ -19,16 +19,44 @@
 
 #include "arrow/engine/substrait/expression_internal.h"
 
+#include <stdint.h>
+#include <algorithm>
+#include <array>
+#include <cstring>
+#include <functional>
 #include <memory>
+#include <optional>
+#include <string>
+#include <string_view>
+#include <type_traits>
 #include <utility>
-
-#include "arrow/builder.h"

Review Comment:
   What tool are you using to generate these includes?  I'm not sure about exploding `arrow/builder.h` but overall it seems like you caught quite a few IWYU violations which is handy but this is also introducing a lot of forward declarations which are already in `type_fwd.h` files.



##########
cpp/src/arrow/engine/substrait/expression_internal.cc:
##########
@@ -19,16 +19,44 @@
 
 #include "arrow/engine/substrait/expression_internal.h"
 
+#include <stdint.h>

Review Comment:
   ```suggestion
   #include <cstdint>
   ```



##########
cpp/src/arrow/engine/substrait/relation_internal.cc:
##########
@@ -689,7 +695,68 @@ Result<std::unique_ptr<substrait::ReadRel>> ScanRelationConverter(
     read_rel_lfs->mutable_items()->AddAllocated(read_rel_lfs_ffs.release());
   }
   read_rel->set_allocated_local_files(read_rel_lfs.release());
-  return std::move(read_rel);
+
+  return Status::OK();
+}
+
+Status SetReadRelationNamedTable(const dataset::InMemoryDataset* dataset,
+                                 std::unique_ptr<substrait::ReadRel>& read_rel) {
+  const auto& metadata = dataset->schema()->metadata();
+  if (metadata == nullptr || metadata->size() == 0) {
+    return Status::Invalid("Table name not found in dataset schema metadata");
+  }
+
+  ARROW_ASSIGN_OR_RAISE(auto table_name, metadata->Get(Table::kTableNameKey));
+  if (table_name.empty()) {
+    return Status::Invalid("Table name cannot be empty");
+  }
+
+  auto read_rel_tn = std::make_unique<substrait::ReadRel::NamedTable>();
+
+  const auto names = SplitString(table_name, '.');
+  for (auto& name : names) {
+    read_rel_tn->add_names(name.data(), name.size());
+  }
+  read_rel->set_allocated_named_table(read_rel_tn.release());
+
+  return Status::OK();
+}
+
+Result<std::unique_ptr<substrait::ReadRel>> ScanRelationConverter(
+    const std::shared_ptr<Schema>& schema, const compute::Declaration& declaration,
+    ExtensionSet* ext_set, const ConversionOptions& conversion_options) {
+  auto read_rel = std::make_unique<substrait::ReadRel>();
+  const auto& scan_node_options =
+      checked_cast<const dataset::ScanNodeOptions&>(*declaration.options);
+  auto* dataset = scan_node_options.dataset.get();
+
+  // set schema
+  ARROW_ASSIGN_OR_RAISE(auto named_struct,
+                        ToProto(*dataset->schema(), ext_set, conversion_options));
+  read_rel->set_allocated_base_schema(named_struct.release());
+
+  auto* file_system_dataset = dynamic_cast<dataset::FileSystemDataset*>(dataset);
+  if (file_system_dataset != nullptr) {
+    ARROW_RETURN_NOT_OK(SetReadRelationLocalFiles(file_system_dataset, read_rel));
+    return std::move(read_rel);
+  }
+
+  auto* in_memory_dataset = dynamic_cast<dataset::InMemoryDataset*>(dataset);
+  if (in_memory_dataset != nullptr) {
+    ARROW_ASSIGN_OR_RAISE(auto scanner_builder, in_memory_dataset->NewScan());
+    ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());
+    ARROW_ASSIGN_OR_RAISE(auto num_rows, scanner->CountRows());

Review Comment:
   We could maybe just add a `NumRows` method to `InMemoryDataset` to avoid having to create and run a scan which might save a bit of complexity.



##########
cpp/src/arrow/table.h:
##########
@@ -41,6 +41,8 @@ class MemoryPool;
 /// \brief Logical table as sequence of chunked arrays
 class ARROW_EXPORT Table {
  public:
+  static constexpr char const kTableNameKey[] = "tableName";
+

Review Comment:
   Given that this key is specific to Substrait use I'm not sure this is the right place for this.  In fact the "table" here is not an Arrow table but more of a database table (which is more similar to an arrow dataset).  Can you move this constant into the substrait package?



##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -2221,6 +2252,67 @@ TEST(Substrait, BasicPlanRoundTrippingEndToEnd) {
   EXPECT_TRUE(expected_table->Equals(*rnd_trp_table));
 }
 
+TEST(Substrait, FilterNamedTable) {
+  compute::ExecContext exec_context;
+  arrow::dataset::internal::Initialize();
+
+  const std::vector<std::string> table_name{"table", "1"};
+  const std::unordered_map<std::string, std::string> metadata = {
+      {Table::kTableNameKey, arrow::internal::JoinStrings(table_name, ".")}};
+  const auto dummy_schema =
+      schema({field("A", int32()), field("B", int32()), field("C", int32())},
+             key_value_metadata(metadata));
+  auto dataset =
+      std::make_shared<dataset::InMemoryDataset>(dummy_schema, RecordBatchVector{});
+
+  auto scan_options = std::make_shared<dataset::ScanOptions>();
+  scan_options->projection = compute::project({}, {});
+  auto filter = compute::equal(compute::field_ref("A"), compute::field_ref("B"));
+
+  auto declarations = compute::Declaration::Sequence(
+      {compute::Declaration(
+           {"scan", dataset::ScanNodeOptions{dataset, scan_options}, "s"}),
+       compute::Declaration({"filter", compute::FilterNodeOptions{filter}, "f"})});
+
+  std::shared_ptr<ExtensionIdRegistry> sp_ext_id_reg = MakeExtensionIdRegistry();
+  ExtensionIdRegistry* ext_id_reg = sp_ext_id_reg.get();
+  ExtensionSet ext_set(ext_id_reg);

Review Comment:
   You probably don't need to create an `ExtensionIdRegistry` (it will just use the default if you do not supply one).



##########
cpp/src/arrow/engine/substrait/relation_internal.cc:
##########
@@ -689,7 +695,68 @@ Result<std::unique_ptr<substrait::ReadRel>> ScanRelationConverter(
     read_rel_lfs->mutable_items()->AddAllocated(read_rel_lfs_ffs.release());
   }
   read_rel->set_allocated_local_files(read_rel_lfs.release());
-  return std::move(read_rel);
+
+  return Status::OK();
+}
+
+Status SetReadRelationNamedTable(const dataset::InMemoryDataset* dataset,

Review Comment:
   ```suggestion
   Status SetReadRelationNamedTable(const dataset::InMemoryDataset& dataset,
   ```
   The style guide prefers const reference over const pointer:
   
   > Non-optional input parameters should usually be values or const references



##########
cpp/src/arrow/engine/substrait/type_internal.cc:
##########
@@ -471,8 +482,14 @@ void ToProtoGetDepthFirstNames(const FieldVector& fields,
 Result<std::unique_ptr<::substrait::NamedStruct>> ToProto(
     const Schema& schema, ExtensionSet* ext_set,
     const ConversionOptions& conversion_options) {
-  if (schema.metadata()) {
-    return Status::Invalid("::substrait::NamedStruct does not support schema metadata");
+  const auto& metadata = schema.metadata();
+  if (metadata) {
+    std::unordered_map<std::string, std::string> metadata_map{};
+    metadata->ToUnorderedMap(&metadata_map);
+    metadata_map.erase(Table::kTableNameKey);
+    if (!metadata_map.empty()) {
+      return Status::Invalid("::substrait::NamedStruct does not support schema metadata");
+    }

Review Comment:
   Maybe only check this if the conversion strictness is `EXACT_ROUNDTRIP`?  Otherwise it seems ok to drop/ignore the metadata.



##########
cpp/src/arrow/engine/substrait/serde.h:
##########
@@ -25,19 +25,34 @@
 #include <string_view>
 #include <vector>
 
-#include "arrow/compute/type_fwd.h"
-#include "arrow/dataset/type_fwd.h"
 #include "arrow/engine/substrait/options.h"
-#include "arrow/engine/substrait/type_fwd.h"
 #include "arrow/engine/substrait/visibility.h"
 #include "arrow/result.h"
 #include "arrow/status.h"
-#include "arrow/type_fwd.h"
 #include "arrow/util/macros.h"
 
 namespace arrow {
+
+class Buffer;
+class DataType;
+class Schema;
+
+namespace compute {
+class ExecPlan;
+class Expression;
+class SinkNodeConsumer;
+struct Declaration;
+}  // namespace compute
+
+namespace dataset {
+class WriteNodeOptions;
+}
+
 namespace engine {
 
+class ExtensionIdRegistry;
+class ExtensionSet;

Review Comment:
   I'm not sure this is better than using the `type_fwd.h` files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ljishen commented on pull request #14681: ARROW-18367: [C++] Enable using InMemoryDataset to create substrait plans

Posted by GitBox <gi...@apache.org>.
ljishen commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1324090847

   @westonpace Thanks for the great suggestions! I'll fix them and update the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
westonpace merged PR #14681:
URL: https://github.com/apache/arrow/pull/14681


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ljishen commented on pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
ljishen commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1343454183

   @westonpace thanks for the fix!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ljishen commented on a diff in pull request #14681: ARROW-18367: [C++] Enable using InMemoryDataset to create substrait plans

Posted by GitBox <gi...@apache.org>.
ljishen commented on code in PR #14681:
URL: https://github.com/apache/arrow/pull/14681#discussion_r1030009256


##########
cpp/src/arrow/engine/substrait/expression_internal.cc:
##########
@@ -19,16 +19,44 @@
 
 #include "arrow/engine/substrait/expression_internal.h"
 
+#include <stdint.h>
+#include <algorithm>
+#include <array>
+#include <cstring>
+#include <functional>
 #include <memory>
+#include <optional>
+#include <string>
+#include <string_view>
+#include <type_traits>
 #include <utility>
-
-#include "arrow/builder.h"

Review Comment:
   I used the `build-support/iwyu/iwyu.sh` to generate these includes. I've updated the code to use `type_fwd.h` files as much as possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14681: ARROW-18367: Enable using InMemoryDataset to create substrait plans

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1321050807

   https://issues.apache.org/jira/browse/ARROW-18367


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14681: ARROW-18367: Enable using InMemoryDataset to create substrait plans

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1321050813

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ljishen commented on pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
ljishen commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1339754698

   @lidavidm Could you help to take a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ljishen commented on pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
ljishen commented on PR #14681:
URL: https://github.com/apache/arrow/pull/14681#issuecomment-1341842681

   @westonpace No problem at all! Thanks so much for your review. I've updated the code with minor changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #14681: ARROW-18367: [C++] Enable the creation of named table relations

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #14681:
URL: https://github.com/apache/arrow/pull/14681#discussion_r1041921905


##########
cpp/src/arrow/engine/substrait/relation_internal.cc:
##########
@@ -638,6 +659,10 @@ Result<std::shared_ptr<Schema>> ExtractSchemaToBind(const compute::Declaration&
   } else if (declr.factory_name == "filter") {
     auto input_declr = std::get<compute::Declaration>(declr.inputs[0]);
     ARROW_ASSIGN_OR_RAISE(bind_schema, ExtractSchemaToBind(input_declr));
+  } else if (declr.factory_name == "named_table") {
+    const auto& opts =
+        checked_cast<const compute::NamedTableNodeOptions&>(*(declr.options));

Review Comment:
   ```suggestion
           checked_cast<const compute::NamedTableNodeOptions&>(*declr.options);
   ```



##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -2260,6 +2293,57 @@ TEST(SubstraitRoundTrip, BasicPlanEndToEnd) {
   EXPECT_TRUE(expected_table->Equals(*rnd_trp_table));
 }
 
+TEST(SubstraitRoundTrip, FilterNamedTable) {
+  compute::ExecContext exec_context;
+  arrow::dataset::internal::Initialize();
+
+  const std::vector<std::string> table_names{"table", "1"};
+  const auto dummy_schema =
+      schema({field("A", int32()), field("B", int32()), field("C", int32())});
+  auto filter = compute::equal(compute::field_ref("A"), compute::field_ref("B"));
+
+  auto declarations = compute::Declaration::Sequence(
+      {compute::Declaration({"named_table",
+                             compute::NamedTableNodeOptions{table_names, dummy_schema},
+                             "n"}),
+       compute::Declaration({"filter", compute::FilterNodeOptions{filter}, "f"})});
+
+  ExtensionSet ext_set{};
+  ASSERT_OK_AND_ASSIGN(auto serialized_plan, SerializePlan(declarations, &ext_set));
+
+  // creating a dummy dataset using a dummy table
+  auto input_table = TableFromJSON(dummy_schema, {R"([
+      [1, 1, 10],
+      [3, 5, 20],
+      [4, 1, 30],
+      [2, 1, 40],
+      [5, 5, 50],
+      [2, 2, 60]
+  ])"});
+
+  NamedTableProvider table_provider =
+      [&input_table, &table_names](
+          const std::vector<std::string>& names) -> Result<compute::Declaration> {
+    if (table_names != names) {
+      return Status::Invalid("Table name mismatch");
+    }
+    std::shared_ptr<compute::ExecNodeOptions> options =
+        std::make_shared<compute::TableSourceNodeOptions>(input_table);
+    return compute::Declaration("table_source", {}, options, "mock_source");

Review Comment:
   ```suggestion
       return compute::Declaration("table_source", {}, std::move(options), "mock_source");
   ```



##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -85,6 +85,19 @@ class ARROW_EXPORT TableSourceNodeOptions : public ExecNodeOptions {
   int64_t max_batch_size;
 };
 
+/// \brief Define a lazy resolved Arrow table.
+///
+/// The table uniquely identified by the names can typically be resolved at the time when
+/// the plan is to be consumed.

Review Comment:
   ```suggestion
   /// the plan is to be consumed.
   ///
   /// This node is for serialization purposes only and can never be executed
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org