You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/25 16:06:58 UTC

[GitHub] [arrow-datafusion] andygrove opened a new pull request, #2791: Add config option for coalesce_batches physical optimization rule

andygrove opened a new pull request, #2791:
URL: https://github.com/apache/arrow-datafusion/pull/2791

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes https://github.com/apache/arrow-datafusion/issues/2790
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Give users control over optimization rules that are applied.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   - Add new config options
   - Disable `CoalesceBatchesExec` by default since it can add overhead
   - Unrelated change to update the config docs generator to always sort by config key
   - Update the config docs
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   No
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2791: Add config option for coalesce_batches physical optimization rule

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2791:
URL: https://github.com/apache/arrow-datafusion/pull/2791#discussion_r906699198


##########
datafusion/core/src/config.rs:
##########
@@ -124,7 +147,11 @@ impl BuiltInConfigs {
         let configs = Self::new();
         let mut docs = "| key | type | default | description |\n".to_string();
         docs += "|-----|------|---------|-------------|\n";
-        for config in configs.config_definitions {
+        for config in configs
+            .config_definitions
+            .iter()
+            .sorted_by_key(|c| c.key.clone())

Review Comment:
   Make the output deterministic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2791: Add config option for coalesce_batches physical optimization rule, make optional

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2791:
URL: https://github.com/apache/arrow-datafusion/pull/2791#discussion_r908739989


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1247,16 +1250,26 @@ impl SessionState {
         rules.push(Arc::new(LimitPushDown::new()));
         rules.push(Arc::new(SingleDistinctToGroupBy::new()));
 
+        let mut physical_optimizers: Vec<Arc<dyn PhysicalOptimizerRule + Sync + Send>> = vec![
+            Arc::new(AggregateStatistics::new()),
+            Arc::new(HashBuildProbeOrder::new()),
+        ];
+        if config.config_options.get_bool(OPT_COALESCE_BATCHES) {

Review Comment:
   I have added tests for disabling the optimizer rule and also for setting a custom batch size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2791: Add config option for coalesce_batches physical optimization rule

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2791:
URL: https://github.com/apache/arrow-datafusion/pull/2791#issuecomment-1166333659

   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2791?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2791](https://codecov.io/gh/apache/arrow-datafusion/pull/2791?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (73066af) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/7c60412b7beb599b8f6b9ed49528a450aa48e56b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7c60412) will **decrease** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2791      +/-   ##
   ==========================================
   - Coverage   85.14%   85.14%   -0.01%     
   ==========================================
     Files         273      273              
     Lines       48248    48259      +11     
   ==========================================
   + Hits        41079    41088       +9     
   - Misses       7169     7171       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2791?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/core/src/config.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2791/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9jb25maWcucnM=) | `90.76% <100.00%> (+0.44%)` | :arrow_up: |
   | [datafusion/core/src/execution/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2791/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9leGVjdXRpb24vY29udGV4dC5ycw==) | `78.72% <100.00%> (+0.18%)` | :arrow_up: |
   | [...on/core/src/physical\_optimizer/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2791/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9vcHRpbWl6ZXIvY29hbGVzY2VfYmF0Y2hlcy5ycw==) | `100.00% <100.00%> (ø)` | |
   | [datafusion/expr/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2791/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `73.51% <0.00%> (-0.40%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2791?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2791?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [7c60412...73066af](https://codecov.io/gh/apache/arrow-datafusion/pull/2791?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2791: Add config option for coalesce_batches physical optimization rule, make optional

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2791:
URL: https://github.com/apache/arrow-datafusion/pull/2791#discussion_r908103294


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1247,16 +1250,26 @@ impl SessionState {
         rules.push(Arc::new(LimitPushDown::new()));
         rules.push(Arc::new(SingleDistinctToGroupBy::new()));
 
+        let mut physical_optimizers: Vec<Arc<dyn PhysicalOptimizerRule + Sync + Send>> = vec![
+            Arc::new(AggregateStatistics::new()),
+            Arc::new(HashBuildProbeOrder::new()),
+        ];
+        if config.config_options.get_bool(OPT_COALESCE_BATCHES) {

Review Comment:
   Sounds good. I will get to this in the next day or two.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2791: Add config option for coalesce_batches physical optimization rule, make optional

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2791:
URL: https://github.com/apache/arrow-datafusion/pull/2791#discussion_r906813144


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1247,16 +1250,26 @@ impl SessionState {
         rules.push(Arc::new(LimitPushDown::new()));
         rules.push(Arc::new(SingleDistinctToGroupBy::new()));
 
+        let mut physical_optimizers: Vec<Arc<dyn PhysicalOptimizerRule + Sync + Send>> = vec![
+            Arc::new(AggregateStatistics::new()),
+            Arc::new(HashBuildProbeOrder::new()),
+        ];
+        if config.config_options.get_bool(OPT_COALESCE_BATCHES) {

Review Comment:
   I wonder if it would be helpful to add a test to ensure the option to disable coalsce'ing batches doesn't get broken in the future



##########
datafusion/core/src/config.rs:
##########
@@ -27,6 +28,13 @@ pub const OPT_FILTER_NULL_JOIN_KEYS: &str = "datafusion.optimizer.filter_null_jo
 /// Configuration option "datafusion.execution.batch_size"
 pub const OPT_BATCH_SIZE: &str = "datafusion.execution.batch_size";
 
+/// Configuration option "datafusion.execution.coalesce_batches"

Review Comment:
   I really like how this new option framework is coming together ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan merged pull request #2791: Add config option for coalesce_batches physical optimization rule, make optional

Posted by GitBox <gi...@apache.org>.
Dandandan merged PR #2791:
URL: https://github.com/apache/arrow-datafusion/pull/2791


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org