You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 23:00:46 UTC
[GitHub] [arrow] westonpace commented on a change in pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

westonpace commented on a change in pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#discussion_r819129303



##########
File path: cpp/examples/arrow/execution_plan_documentation_examples.cc
##########
@@ -353,6 +363,38 @@ arrow::Status SourceSinkExample(cp::ExecContext& exec_context) {
 }
 // (Doc section: Source Example)
 
+// (Doc section: Table Source Example)
+/**
+ * \brief
+ * TableSource-Sink Example
+ * This example shows how a tabl_source and sink can be used
+ * in an execution plan. This includes table source node
+ * receiving data as a table and the sink node emits
+ * the data as an output represented in a table.

Review comment:
       ```suggestion
    * This example shows how a table_source and sink can be used
    * in an execution plan. This includes a table source node
    * receiving data from a table and the sink node emits
    * the data to a generator which we collect into a table.
   ```

##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -396,6 +397,31 @@ Example of using ``source`` (usage of sink is explained in detail in :ref:`sink<
   :linenos:
   :lineno-match:
 
+``table_source``
+----------------
+
+.. _stream_execution_table_source_docs:
+
+In the previous example, :ref:`source node <stream_execution_source_docs>`, a source node
+was used to input the data and the node expected data to be fed using a generator. But in 
+developing application much easier and to enable performance optimization, the 
+:class:`arrow::compute::TableSourceNodeOptions` can be used. Here the input data can be 
+passed as a ``std::shared_ptr<arrow::Table>`` along with a ``max_batch_size``. The objetive
+of the ``max_batch_size`` is to enable the user to tune the performance given the nature of 
+vivid workloads. Another important fact is that the streaming execution engine will use
+the ``max_batch_size`` as execution batch size. And it is important to note that the output
+batches from an execution plan doesn't get merged to form larger batches when an inserted 
+table has smaller batch size. 

Review comment:
       ```suggestion
   was used to input the data.  But when developing an application, if the data is already in memory
   as a table, it is much easier, and more performant to use :class:`arrow::compute::TableSourceNodeOptions`.
   Here the input data can be passed as a ``std::shared_ptr<arrow::Table>`` along with a ``max_batch_size``. The  ``max_batch_size`` is to break up large record batches so that they can be processed in parallel.  It is important to note that the table batches will not get merged to form larger batches when the source table has a smaller batch size. 
   ```
   
   At the moment we use the source node's batch size as the execution batch size.  Recently, however, there have been talks about maybe using a smaller batch size for execution.  So we would slice a large chunk off of the source, create a thread task based on the large chunk, and then iterate through that large chunk in smaller chunks.  So for now, let's just avoid talking about the execution batch size until that is more finalized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org