You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/01 06:30:16 UTC

[GitHub] [arrow] rtpsw opened a new pull request, #13040: ARROW-16426: [C++] Add TeeNode to execution engine

rtpsw opened a new pull request, #13040:
URL: https://github.com/apache/arrow/pull/13040

   The existing write node is a consuming one while the proposed tee node is a pass-through one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
rtpsw commented on PR #13040:
URL: https://github.com/apache/arrow/pull/13040#issuecomment-1118662972

   @westonpace, the `R / AMD64 Windows C++ RTools 40 ucrt64` build error seems unrelated to the change. How to proceed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
rtpsw commented on code in PR #13040:
URL: https://github.com/apache/arrow/pull/13040#discussion_r866003157


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -413,9 +428,102 @@ Result<compute::ExecNode*> MakeWriteNode(compute::ExecPlan* plan,
   return node;
 }
 
+namespace {
+
+class TeeNode : public compute::MapNode {

Review Comment:
   From the [`tee` utility](https://www.man7.org/linux/man-pages/man1/tee.1.html), which probably refers to a T-junction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
rtpsw commented on PR #13040:
URL: https://github.com/apache/arrow/pull/13040#issuecomment-1115959047

   @westonpace, the build failures shows a dependency problem in some of the platforms. I'm not sure how to resolve this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13040:
URL: https://github.com/apache/arrow/pull/13040#issuecomment-1114147349

   https://issues.apache.org/jira/browse/ARROW-16426


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13040:
URL: https://github.com/apache/arrow/pull/13040#issuecomment-1124209281

   Benchmark runs are scheduled for baseline = d00caa947ea53774e0e6745c857bf622457081ad and contender = 7bfc7320285174beca11256e6258c99a208ce4b7. 7bfc7320285174beca11256e6258c99a208ce4b7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/8f7a8b9f73604f1299c30c4d0277570f...fb1aaf665a194dc78672050cf2fc355c/)
   [Finished :arrow_down:0.08% :arrow_up:0.04%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/1f2c3b60eb534f179c7a2d1a74924c26...8e82c4a7a588402796ec389cacca564a/)
   [Finished :arrow_down:0.71% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/32e4011505c344b79a3fabcbe26c3491...8e73cae0c5c4444e9cb4a5338e73ce38/)
   [Finished :arrow_down:0.0% :arrow_up:0.08%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f879f93915af46c6b3f42bf4ab1fe520...59cf9dd2fde74fc894bfe47e71b93ff1/)
   Buildkite builds:
   [Finished] [`7bfc7320` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/727)
   [Finished] [`7bfc7320` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/724)
   [Finished] [`7bfc7320` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/714)
   [Finished] [`7bfc7320` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/729)
   [Finished] [`d00caa94` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/726)
   [Finished] [`d00caa94` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/723)
   [Finished] [`d00caa94` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/713)
   [Finished] [`d00caa94` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/728)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13040:
URL: https://github.com/apache/arrow/pull/13040#discussion_r865534886


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -413,9 +428,108 @@ Result<compute::ExecNode*> MakeWriteNode(compute::ExecPlan* plan,
   return node;
 }
 
+namespace {
+
+class TeeNode : public compute::MapNode, public compute::BackpressureControl {

Review Comment:
   You don't need to extend `compute::BackpressureControl` here (and you don't need `backpressure_control_`.  That's only needed for the `ConsumingSink` to give the `SinkNodeConsumer` something they can hold onto to pause/resume.
   
   Since you're calling pause/resume from the node itself you can just use `this` for your calls to `Pause`/`Resume`



##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -413,9 +428,108 @@ Result<compute::ExecNode*> MakeWriteNode(compute::ExecPlan* plan,
   return node;
 }
 
+namespace {
+
+class TeeNode : public compute::MapNode, public compute::BackpressureControl {
+ public:
+  TeeNode(compute::ExecPlan* plan, std::vector<compute::ExecNode*> inputs,
+          std::shared_ptr<Schema> output_schema,
+          std::unique_ptr<internal::DatasetWriter> dataset_writer,
+          FileSystemDatasetWriteOptions write_options, bool async_mode)
+      : MapNode(plan, std::move(inputs), std::move(output_schema), async_mode),
+        dataset_writer_(std::move(dataset_writer)),
+        write_options_(std::move(write_options)),
+        backpressure_control_(this) {}
+
+  static Result<compute::ExecNode*> Make(compute::ExecPlan* plan,
+                                         std::vector<compute::ExecNode*> inputs,
+                                         const compute::ExecNodeOptions& options) {
+    RETURN_NOT_OK(ValidateExecNodeInputs(plan, inputs, 1, "TeeNode"));
+
+    const WriteNodeOptions write_node_options =
+        checked_cast<const WriteNodeOptions&>(options);
+    const FileSystemDatasetWriteOptions& write_options = write_node_options.write_options;
+    const std::shared_ptr<Schema> schema = inputs[0]->output_schema();
+
+    ARROW_ASSIGN_OR_RAISE(auto dataset_writer,
+                          internal::DatasetWriter::Make(write_options));
+
+    return plan->EmplaceNode<TeeNode>(plan, std::move(inputs), std::move(schema),
+                                      std::move(dataset_writer), std::move(write_options),
+                                      /*async_mode=*/true);
+  }
+
+  const char* kind_name() const override { return "TeeNode"; }
+
+  Result<compute::ExecBatch> DoTee(const compute::ExecBatch& batch) {
+    ARROW_ASSIGN_OR_RAISE(std::shared_ptr<RecordBatch> record_batch,
+                          batch.ToRecordBatch(output_schema()));
+    ARROW_RETURN_NOT_OK(WriteNextBatch(std::move(record_batch), batch.guarantee));
+    return batch;
+  }
+
+  Status WriteNextBatch(std::shared_ptr<RecordBatch> batch,
+                        compute::Expression guarantee) {
+    return WriteBatch(
+        batch, guarantee, write_options_,
+        [this](std::shared_ptr<RecordBatch> next_batch,
+               const Partitioning::PartitionPathFormat& destination) {
+          return task_group_.AddTask([this, next_batch, destination] {
+            util::tracing::Span span;
+            START_COMPUTE_SPAN(span, "Tee",
+                               {{"tee.base_dir", ToStringExtra()},
+                                {"tee.length", next_batch->num_rows()}});

Review Comment:
   This span isn't measuring anything terribly useful that I can tell.  `WriteRecordBatch` returns immediately after it queues the batch into the appropriate file queue so this span won't capture the actual time it took to write the batch.  Also, the timing here will be included as part of `InputReceived` anyways.  The `base_dir` might be slightly useful as an event to the current parent span but I'm not sure that is what `ToStringExtra()` is going to return.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine
URL: https://github.com/apache/arrow/pull/13040


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13040:
URL: https://github.com/apache/arrow/pull/13040#issuecomment-1121780108

   I agree it is unrelated.  In the future, if there are build failures and they seem unrelated feel free to just mention something like "build failures seem unrelated" and (assuming its ready) re-request review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
rtpsw commented on code in PR #13040:
URL: https://github.com/apache/arrow/pull/13040#discussion_r866003157


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -413,9 +428,102 @@ Result<compute::ExecNode*> MakeWriteNode(compute::ExecPlan* plan,
   return node;
 }
 
+namespace {
+
+class TeeNode : public compute::MapNode {

Review Comment:
   From the [`tee` text utility](https://www.man7.org/linux/man-pages/man1/tee.1.html), which probably refers to a T-junction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13040:
URL: https://github.com/apache/arrow/pull/13040#issuecomment-1114147354

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13040:
URL: https://github.com/apache/arrow/pull/13040#issuecomment-1116417482

   Ah, it seems `MapNode` had not previously ever been used outside of the compute module.  `file_base.cc` is in the datasets module.  You will need to add the appropriate exports to `MapNode`.  This is an unfortunate process that seems to be unique to Windows.  This should be as simple as...
   
   ```
   class ARROW_EXPORT MapNode : public ExecNode {
   ...
   ```
   
   Although exposing things tends to have somewhat infectious consequences.  For example, I'm pretty sure you will then also need to expose `AtomicCounter` since it is referenced by `MapNode`.  However, that should be about it and it makes sense to expose both of those things as they will be generally useful to anyone building `ExecNode` implementations outside of Arrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #13040: ARROW-16426: [C++] Add TeeNode to execution engine

Posted by GitBox <gi...@apache.org>.
icexelloss commented on code in PR #13040:
URL: https://github.com/apache/arrow/pull/13040#discussion_r865968367


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -413,9 +428,102 @@ Result<compute::ExecNode*> MakeWriteNode(compute::ExecPlan* plan,
   return node;
 }
 
+namespace {
+
+class TeeNode : public compute::MapNode {

Review Comment:
   Where does the name "Tee" come from? Just curious



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org