You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/07 13:26:48 UTC

[GitHub] [spark] wangyum opened a new pull request, #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

wangyum opened a new pull request, #38534:
URL: https://github.com/apache/spark/pull/38534

   ### What changes were proposed in this pull request?
   
   This PR makes `HashAggregateExec` adaptively skip partial aggregation to avoid spilling if partial aggregation do not reduce the number of output rows too much. 
   
   By setting `spark.sql.aggregate.adaptivePartialAggregationThreshold` to 0 this feature can be disabled.
   
   ### Why are the changes needed?
   
   Improve partial aggregation phase performance and we can implement these 2 features after this PR:
   1. SPARK-36245: Partial deduplicate the right side of left semi/anti join
   2. SPARK-38506: Push partial aggregation through join
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test and TPC-H 5T benchmark test.
   
   SQL | Before this PR(Seconds) | After this PR(Seconds)
   -- | -- | --
   q15 | 66  | 51
   q17 | 86 | 80
   q18 | 129 | 122
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] shenjiayu17 commented on pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on PR #38534:
URL: https://github.com/apache/spark/pull/38534#issuecomment-1343908559

   Hi @wangyum. I'm very interested in this optimization on partial aggregation. But why does it need these child node limit? Do they make some influence on function or performance?
   ```
   private[sql] lazy val isAdaptivePartialAggregationEnabled = {
       requiredChildDistributionExpressions.isEmpty && groupingAttributes.nonEmpty &&
         conf.adaptivePartialAggregationThreshold > 0 &&
         conf.adaptivePartialAggregationThreshold < (1 << conf.fastHashAggregateRowMaxCapacityBit) && {
         child
           .collectUntil(p => p.isInstanceOf[WholeStageCodegenExec] ||
             !p.isInstanceOf[CodegenSupport] ||
             p.isInstanceOf[LeafExecNode]).forall {
           case _: ProjectExec | _: FilterExec | _: ColumnarToRowExec => true
           case _: SerializeFromObjectExec => true
           case _: InputAdapter => true
           // HashAggregateExec, ExpandExec, SortMergeJoinExec ...
           case _ => false
         }
       }
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive
URL: https://github.com/apache/spark/pull/38534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38534:
URL: https://github.com/apache/spark/pull/38534#issuecomment-1477119915

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #38534:
URL: https://github.com/apache/spark/pull/38534#issuecomment-1306369308

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xy2953396112 commented on pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

Posted by "xy2953396112 (via GitHub)" <gi...@apache.org>.
xy2953396112 commented on PR #38534:
URL: https://github.com/apache/spark/pull/38534#issuecomment-1657066744

   @wangyum Does this pr have any progress?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org