You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 12:27:10 UTC

[GitHub] [arrow] vibhatha opened a new pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

vibhatha opened a new pull request #12555:
URL: https://github.com/apache/arrow/pull/12555


   This PR includes the following modifications;
   
   - Adding an example of using `TableSourceNodeOptions` for `table_source` 
   - Updating documentation on streaming execution engine
   - Minor refactoring to redefine definition of `batch_size` to `max_batch_size`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#issuecomment-1072918070


   Benchmark runs are scheduled for baseline = ae93d12d40d41d26abb8034cbdfb11d233d29402 and contender = 425efaa52744839cd98c2310e703058046171823. 425efaa52744839cd98c2310e703058046171823 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/619bbdce0209412e882d030bab8f2a22...dbb7b1d9886b4822b97bcba5501c88f8/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b9c6d4e726d04447b70b418f4cc3d86a...113350901b17459eab07a62f55e4d39b/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9deb95f35b0e476ea0d4ad4e046233d0...3b3571997ed94c2eb2e2b61e5f2ef6f2/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/908163fdb0ed45d38b6f885f83d18990...369ae94952df4d17977120cd38a47104/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#discussion_r819129303



##########
File path: cpp/examples/arrow/execution_plan_documentation_examples.cc
##########
@@ -353,6 +363,38 @@ arrow::Status SourceSinkExample(cp::ExecContext& exec_context) {
 }
 // (Doc section: Source Example)
 
+// (Doc section: Table Source Example)
+/**
+ * \brief
+ * TableSource-Sink Example
+ * This example shows how a tabl_source and sink can be used
+ * in an execution plan. This includes table source node
+ * receiving data as a table and the sink node emits
+ * the data as an output represented in a table.

Review comment:
       ```suggestion
    * This example shows how a table_source and sink can be used
    * in an execution plan. This includes a table source node
    * receiving data from a table and the sink node emits
    * the data to a generator which we collect into a table.
   ```

##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -396,6 +397,31 @@ Example of using ``source`` (usage of sink is explained in detail in :ref:`sink<
   :linenos:
   :lineno-match:
 
+``table_source``
+----------------
+
+.. _stream_execution_table_source_docs:
+
+In the previous example, :ref:`source node <stream_execution_source_docs>`, a source node
+was used to input the data and the node expected data to be fed using a generator. But in 
+developing application much easier and to enable performance optimization, the 
+:class:`arrow::compute::TableSourceNodeOptions` can be used. Here the input data can be 
+passed as a ``std::shared_ptr<arrow::Table>`` along with a ``max_batch_size``. The objetive
+of the ``max_batch_size`` is to enable the user to tune the performance given the nature of 
+vivid workloads. Another important fact is that the streaming execution engine will use
+the ``max_batch_size`` as execution batch size. And it is important to note that the output
+batches from an execution plan doesn't get merged to form larger batches when an inserted 
+table has smaller batch size. 

Review comment:
       ```suggestion
   was used to input the data.  But when developing an application, if the data is already in memory
   as a table, it is much easier, and more performant to use :class:`arrow::compute::TableSourceNodeOptions`.
   Here the input data can be passed as a ``std::shared_ptr<arrow::Table>`` along with a ``max_batch_size``. The  ``max_batch_size`` is to break up large record batches so that they can be processed in parallel.  It is important to note that the table batches will not get merged to form larger batches when the source table has a smaller batch size. 
   ```
   
   At the moment we use the source node's batch size as the execution batch size.  Recently, however, there have been talks about maybe using a smaller batch size for execution.  So we would slice a large chunk off of the source, create a thread task based on the large chunk, and then iterate through that large chunk in smaller chunks.  So for now, let's just avoid talking about the execution batch size until that is more finalized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#issuecomment-1072918070


   Benchmark runs are scheduled for baseline = ae93d12d40d41d26abb8034cbdfb11d233d29402 and contender = 425efaa52744839cd98c2310e703058046171823. 425efaa52744839cd98c2310e703058046171823 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/619bbdce0209412e882d030bab8f2a22...dbb7b1d9886b4822b97bcba5501c88f8/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b9c6d4e726d04447b70b418f4cc3d86a...113350901b17459eab07a62f55e4d39b/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9deb95f35b0e476ea0d4ad4e046233d0...3b3571997ed94c2eb2e2b61e5f2ef6f2/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/908163fdb0ed45d38b6f885f83d18990...369ae94952df4d17977120cd38a47104/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#issuecomment-1058676630


   Thank you for the review. I will address these issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#issuecomment-1057993027


   https://issues.apache.org/jira/browse/ARROW-15820


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace closed pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #12555:
URL: https://github.com/apache/arrow/pull/12555


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#issuecomment-1072918070


   Benchmark runs are scheduled for baseline = ae93d12d40d41d26abb8034cbdfb11d233d29402 and contender = 425efaa52744839cd98c2310e703058046171823. 425efaa52744839cd98c2310e703058046171823 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/619bbdce0209412e882d030bab8f2a22...dbb7b1d9886b4822b97bcba5501c88f8/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b9c6d4e726d04447b70b418f4cc3d86a...113350901b17459eab07a62f55e4d39b/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9deb95f35b0e476ea0d4ad4e046233d0...3b3571997ed94c2eb2e2b61e5f2ef6f2/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/908163fdb0ed45d38b6f885f83d18990...369ae94952df4d17977120cd38a47104/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#issuecomment-1072918070


   Benchmark runs are scheduled for baseline = ae93d12d40d41d26abb8034cbdfb11d233d29402 and contender = 425efaa52744839cd98c2310e703058046171823. 425efaa52744839cd98c2310e703058046171823 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/619bbdce0209412e882d030bab8f2a22...dbb7b1d9886b4822b97bcba5501c88f8/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b9c6d4e726d04447b70b418f4cc3d86a...113350901b17459eab07a62f55e4d39b/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9deb95f35b0e476ea0d4ad4e046233d0...3b3571997ed94c2eb2e2b61e5f2ef6f2/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/908163fdb0ed45d38b6f885f83d18990...369ae94952df4d17977120cd38a47104/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org