You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/05 08:41:05 UTC

[GitHub] [arrow] rtpsw opened a new pull request, #13069: ARROW-15901: [C++] Support custom output field names in Substrait

rtpsw opened a new pull request, #13069:
URL: https://github.com/apache/arrow/pull/13069

   Replaces https://github.com/apache/arrow/pull/12601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on a diff in pull request #13069: ARROW-15901: [C++] Support custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
vibhatha commented on code in PR #13069:
URL: https://github.com/apache/arrow/pull/13069#discussion_r866900870


##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -232,10 +232,13 @@ class ARROW_EXPORT SinkNodeConsumer {
 /// \brief Add a sink node which consumes data within the exec plan run
 class ARROW_EXPORT ConsumingSinkNodeOptions : public ExecNodeOptions {
  public:
-  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer)
-      : consumer(std::move(consumer)) {}
+  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer,
+                                    std::vector<std::string> names = {})
+      : consumer(std::move(consumer)), names(std::move(names)) {}
 
   std::shared_ptr<SinkNodeConsumer> consumer;
+  /// \brief Names to rename the sink's schema fields to
+  std::vector<std::string> names;

Review Comment:
   Nit: May be distinguish the member variables with an underscore?
   
   ```c++
   std::shared_ptr<SinkNodeConsumer> consumer_;
   std::vector<std::string> names_;
   ```
   
   And change the rest accordingly. Looking into `ConsumingSinkNode`, that is the style adopted. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13069: ARROW-15901: [C++] Support custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13069:
URL: https://github.com/apache/arrow/pull/13069#issuecomment-1118309361

   https://issues.apache.org/jira/browse/ARROW-15901


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13069: ARROW-15901: [C++] Support custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13069:
URL: https://github.com/apache/arrow/pull/13069#discussion_r867032904


##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -232,10 +232,13 @@ class ARROW_EXPORT SinkNodeConsumer {
 /// \brief Add a sink node which consumes data within the exec plan run
 class ARROW_EXPORT ConsumingSinkNodeOptions : public ExecNodeOptions {
  public:
-  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer)
-      : consumer(std::move(consumer)) {}
+  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer,
+                                    std::vector<std::string> names = {})
+      : consumer(std::move(consumer)), names(std::move(names)) {}
 
   std::shared_ptr<SinkNodeConsumer> consumer;
+  /// \brief Names to rename the sink's schema fields to
+  std::vector<std::string> names;

Review Comment:
   In this case I think the naming scheme is correct.  There are two categories of objects we deal with, passive data objects and more complex stateful objects.  The google style guide likes to use structs for the former and classes for the latter and so it describes this as [structs vs. classes](https://google.github.io/styleguide/cppguide.html#Structs_vs._Classes).
   
   In Arrow we are pretty inconsistent about applying `struct` and `class`.  Many of our "options" objects are `class` when they should be `struct` for example.  However, we are pretty consistent about the naming.  For a passive data object all properties should be public and [should not have trailing underscores](https://google.github.io/styleguide/cppguide.html#Variable_Names).
   
   Options objects should always be passive data objects.  Things that travel back and forth across library boundaries should generally be either passive data objects or pure virtual interfaces when possible.  So I do not think a change is needed here.  This options object should be a passive data object and it is following the correct rules for a passive data object.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13069:
URL: https://github.com/apache/arrow/pull/13069#discussion_r868409125


##########
cpp/src/arrow/compute/exec/sink_node.cc:
##########
@@ -263,18 +263,21 @@ class SinkNode : public ExecNode {
 class ConsumingSinkNode : public ExecNode, public BackpressureControl {
  public:
   ConsumingSinkNode(ExecPlan* plan, std::vector<ExecNode*> inputs,
-                    std::shared_ptr<SinkNodeConsumer> consumer)
+                    std::shared_ptr<SinkNodeConsumer> consumer,
+                    std::vector<std::string> names)
       : ExecNode(plan, std::move(inputs), {"to_consume"}, {},
                  /*num_outputs=*/0),
-        consumer_(std::move(consumer)) {}
+        consumer_(std::move(consumer)),
+        names_(names) {}

Review Comment:
   ```suggestion
           names_(std::move(names)) {}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #13069: ARROW-15901: [C++] Support custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
rtpsw commented on PR #13069:
URL: https://github.com/apache/arrow/pull/13069#issuecomment-1118598784

   @westonpace, I don't know why the errors in Travis CI occurred. Any idea how to resolve them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
rtpsw commented on PR #13069:
URL: https://github.com/apache/arrow/pull/13069#issuecomment-1147373495

   @westonpace, this has been pending for a while. How to proceed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13069: ARROW-15901: [C++] Support custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13069:
URL: https://github.com/apache/arrow/pull/13069#discussion_r867035845


##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -232,10 +232,13 @@ class ARROW_EXPORT SinkNodeConsumer {
 /// \brief Add a sink node which consumes data within the exec plan run
 class ARROW_EXPORT ConsumingSinkNodeOptions : public ExecNodeOptions {
  public:
-  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer)
-      : consumer(std::move(consumer)) {}
+  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer,
+                                    std::vector<std::string> names = {})
+      : consumer(std::move(consumer)), names(std::move(names)) {}
 
   std::shared_ptr<SinkNodeConsumer> consumer;
+  /// \brief Names to rename the sink's schema fields to

Review Comment:
   ```suggestion
     /// \brief Names to rename the sink's schema fields to
     ///
     /// If specified then names must be provided for all fields
   ```



##########
cpp/src/arrow/engine/substrait/serde.cc:
##########
@@ -67,15 +67,19 @@ Result<std::vector<compute::Declaration>> DeserializePlan(
 
   std::vector<compute::Declaration> sink_decls;
   for (const substrait::PlanRel& plan_rel : plan.relations()) {
+    ARROW_ASSIGN_OR_RAISE(
+        auto decl,
+        FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(),
+                  ext_set));
+    std::vector<std::string> names;
     if (plan_rel.has_root()) {
-      return Status::NotImplemented("substrait::PlanRel with custom output field names");
+      names.assign(plan_rel.root().names().begin(), plan_rel.root().names().end());
     }
-    ARROW_ASSIGN_OR_RAISE(auto decl, FromProto(plan_rel.rel(), ext_set));
 
     // pipe each relation into a consuming_sink node
     auto sink_decl = compute::Declaration::Sequence({
         std::move(decl),
-        {"consuming_sink", compute::ConsumingSinkNodeOptions{consumer_factory()}},
+        {"consuming_sink", compute::ConsumingSinkNodeOptions{consumer_factory(), names}},

Review Comment:
   ```suggestion
           {"consuming_sink", compute::ConsumingSinkNodeOptions{consumer_factory(), std::move(names)}},
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
rtpsw commented on PR #13069:
URL: https://github.com/apache/arrow/pull/13069#issuecomment-1120192073

   > I have a few minor comments but let's also add a test case for this.
   
   Added.
   
   > This also isn't yet a full implementation of the Substrait spec as it doesn't handle nested fields. For example, if there are three columns and the first column is a struct column with two fields then the names vector should have 5 items in it. However, if we want to postpone naming of nested fields to a future PR I think that is fine, just make sure to add a JIRA for it.
   
   I opted to postpone, and updated the title and the [ARROW-15901](https://issues.apache.org/jira/browse/ARROW-15901) accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13069:
URL: https://github.com/apache/arrow/pull/13069#issuecomment-1147611884

   I think this is ready to go.  If @vibhatha has any more thoughts he can of course still mention them and we can address in a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
rtpsw commented on PR #13069:
URL: https://github.com/apache/arrow/pull/13069#issuecomment-1120192452

   @westonpace, the AppVeyor failure seems unrelated to the change here. How to proceed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
westonpace merged PR #13069:
URL: https://github.com/apache/arrow/pull/13069


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on a diff in pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
vibhatha commented on code in PR #13069:
URL: https://github.com/apache/arrow/pull/13069#discussion_r867223442


##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -232,10 +232,13 @@ class ARROW_EXPORT SinkNodeConsumer {
 /// \brief Add a sink node which consumes data within the exec plan run
 class ARROW_EXPORT ConsumingSinkNodeOptions : public ExecNodeOptions {
  public:
-  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer)
-      : consumer(std::move(consumer)) {}
+  explicit ConsumingSinkNodeOptions(std::shared_ptr<SinkNodeConsumer> consumer,
+                                    std::vector<std::string> names = {})
+      : consumer(std::move(consumer)), names(std::move(names)) {}
 
   std::shared_ptr<SinkNodeConsumer> consumer;
+  /// \brief Names to rename the sink's schema fields to
+  std::vector<std::string> names;

Review Comment:
   Ah I see your point. Thank you for the detailed explanation. 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #13069: ARROW-15901: [C++] Support flat custom output field names in Substrait

Posted by GitBox <gi...@apache.org>.
rtpsw commented on PR #13069:
URL: https://github.com/apache/arrow/pull/13069#issuecomment-1143395787

   @vibhatha, is this good to go? I believe [this test](https://github.com/apache/arrow/pull/13069/files#diff-b4e59a9cedddaaa33c03f220c03ac812f9ebdfb06692c9f91c5f8f5988363b57) covers your request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org