You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/30 01:19:29 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #12689: ARROW-15515: [C++] Update ExecPlan example code and documentation with new options

westonpace commented on a change in pull request #12689:
URL: https://github.com/apache/arrow/pull/12689#discussion_r838028830



##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -647,6 +649,25 @@ SelectK example:
 
 .. _stream_execution_scan_docs:
 
+``table_sink``
+----------------
+
+.. _stream_execution_table_sink_docs:
+
+Considering the variety of sink nodes provided in the streaming execution engine, the ``table_sink`` node 
+provides the ability to take the output as a table. It is much easier to use 
+:class:`arrow::compute::TableSinkNodeOptions`.

Review comment:
       ```suggestion
   The ``table_sink`` node provides the ability to take the output as an in-memory table. 
   This is much simpler to use than the other sink nodes provided by the streaming execution engine
   but it only makes sense when the output fits comfortably in memory.  The node is created using :class:`arrow::compute::TableSinkNodeOptions`.
   ```

##########
File path: cpp/examples/arrow/execution_plan_documentation_examples.cc
##########
@@ -851,6 +851,50 @@ arrow::Status SourceUnionSinkExample(cp::ExecContext& exec_context) {
 
 // (Doc section: Union Example)
 
+// (Doc section: Table Sink Example)
+
+/// \brief An example showing a table sink node
+/// \param exec_context The execution context to run the plan in
+///
+/// TableSink Example
+/// This example shows how a table_sink can be used
+/// in an execution plan. This includes a source node
+/// receiving data as batches and the table sink node
+/// which emits the output as a table.
+arrow::Status TableSinkExample(cp::ExecContext& exec_context) {
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<cp::ExecPlan> plan,
+                        cp::ExecPlan::Make(&exec_context));
+
+  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());
+
+  auto source_node_options = cp::SourceNodeOptions{basic_data.schema, basic_data.gen()};
+
+  ARROW_ASSIGN_OR_RAISE(cp::ExecNode * source,
+                        cp::MakeExecNode("source", plan.get(), {}, source_node_options));
+
+  std::shared_ptr<arrow::Table> output_table;
+  auto table_sink_options = cp::TableSinkNodeOptions{&output_table, basic_data.schema};
+
+  ARROW_RETURN_NOT_OK(
+      cp::MakeExecNode("table_sink", plan.get(), {source}, table_sink_options));
+  // validate the ExecPlan
+  ARROW_RETURN_NOT_OK(plan->Validate());
+  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
+  // start the ExecPlan
+  ARROW_RETURN_NOT_OK(plan->StartProducing());
+
+  auto finish = source->finished();
+
+  RETURN_NOT_OK(finish.status());
+
+  std::cout << "Results : " << output_table->ToString() << std::endl;
+
+  // plan mark finished
+  auto future = plan->finished();
+  return future.status();

Review comment:
       ```suggestion
     // Wait for the plan to finish
     auto finished = plan->finished();
     RETURN_NOT_OK(finished.status());
   
     std::cout << "Results : " << output_table->ToString() << std::endl;
     return Status::OK();
   ```
   I don't think it's a good idea to do `source->finished()` as I don't think it's a good idea to wait on individual nodes.  User's should probably only ever wait on the plan.

##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -647,6 +649,25 @@ SelectK example:
 
 .. _stream_execution_scan_docs:
 
+``table_sink``
+----------------
+
+.. _stream_execution_table_sink_docs:
+
+Considering the variety of sink nodes provided in the streaming execution engine, the ``table_sink`` node 
+provides the ability to take the output as a table. It is much easier to use 
+:class:`arrow::compute::TableSinkNodeOptions`.
+The output data can be obtained as a ``std::shared_ptr<arrow::Table>`` along with the output ``schema``. 

Review comment:
       Technically it was an input schema I think (and it's going away now).  A `Table` already has a schema associated with it so there is no need to specify the output schema separately.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org