You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/10 08:51:19 UTC

[GitHub] [spark] wangyum opened a new pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

wangyum opened a new pull request #29709:
URL: https://github.com/apache/spark/pull/29709


   ### What changes were proposed in this pull request?
   
   This pr add support dynamic partition pruning hint.
   
   
   ### Why are the changes needed?
   
   For some joins, one side is very small but can not broadcast, for example:
   - left join but left side very small
   - right join but the right side very small
   
   Some real case from our cluster:
   - Case 1:
   ![image](https://user-images.githubusercontent.com/5399861/92703898-24b0fc00-f385-11ea-8ed8-be4f9e43460a.png)
   - Case 2:
   ![image](https://user-images.githubusercontent.com/5399861/92704038-3e524380-f385-11ea-9c3a-faba0815cb59.png)
   
   
   At this time, `PlanDynamicPruningFilters` may not insert dynamic partition pruning. We can add a hint to support users to force insert a dynamic partition pruning for specific side.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690160506


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690160506






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690328789


   > Can't we just do this automatically by applying the "would this be small
   enough to broadcast" criterion instead of looking at the "is this actually
   selected for broadcast"?
   
   First of all, this is a `SortMergeJoin`. To use DPP, you need to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`. But disabling `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` may affect other Join.
   
   Second, the statistics of the plan are usually inaccurate, which makes us unable to determine whether it is suitable for broadcasting in the plan phase.
   
   So, I think it’s most appropriate to add hint.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690095395






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-748702663


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690094656


   **[Test build #128503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128503/testReport)** for PR 29709 at commit [`9844c8c`](https://github.com/apache/spark/commit/9844c8cec9bc690036c4d9606d3d23ff5ae52893).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bart-samwel commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
bart-samwel commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-691034060


   On Thu, Sep 10, 2020 at 4:31 PM Yuming Wang <no...@github.com>
   wrote:
   
   > Can't we just do this automatically by applying the "would this be small
   > enough to broadcast" criterion instead of looking at the "is this actually
   > selected for broadcast"?
   >
   > First of all, this is a SortMergeJoin. To use DPP, you need to disable
   > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly. But
   > disabling spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly
   > may affect other Join.
   >
   That just means that
   spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly needs a
   second
   spark.sql.optimizer.dynamicPartitionPruning.enableForBroadcastSizedJoinInputsOnly,
   which is slightly more lenient because it also applies to things that are
   broadcast-sized but that don't actually get broadcast because of <reasons>.
   
   
   > Second, the statistics of the plan are usually inaccurate, which makes us
   > unable to determine whether it is suitable for broadcasting in the plan
   > phase.
   >
   But we use statistics for that *now* already -- inaccurate as they are. And
   your argument is that this is for things that would broadcast otherwise
   except for the join type, which implies that the statistics would have
   worked here?
   
   
   > So, I think it’s most appropriate to add hint.
   >
   > —
   > You are receiving this because you commented.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/29709#issuecomment-690328789>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AKOBKFDN237RJFSSUW3XLLTSFDPNDANCNFSM4RESONJA>
   > .
   >
   
   
   -- 
   Bart Samwel
   bart.samwel@databricks.com
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690159908


   **[Test build #128503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128503/testReport)** for PR 29709 at commit [`9844c8c`](https://github.com/apache/spark/commit/9844c8cec9bc690036c4d9606d3d23ff5ae52893).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690095395






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690160512


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128503/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bart-samwel commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
bart-samwel commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-691034060






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bart-samwel commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
bart-samwel commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-691034060






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29709:
URL: https://github.com/apache/spark/pull/29709


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690094656


   **[Test build #128503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128503/testReport)** for PR 29709 at commit [`9844c8c`](https://github.com/apache/spark/commit/9844c8cec9bc690036c4d9606d3d23ff5ae52893).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bart-samwel commented on pull request #29709: [WIP][SPARK-32842] Support dynamic partition pruning hint

Posted by GitBox <gi...@apache.org>.
bart-samwel commented on pull request #29709:
URL: https://github.com/apache/spark/pull/29709#issuecomment-690101496


   Can't we just do this automatically by applying the "would this be small
   enough to broadcast" criterion instead of looking at the "is this actually
   selected for broadcast"?
   
   On Thu, Sep 10, 2020 at 11:00 AM Apache Spark QA <no...@github.com>
   wrote:
   
   > *Test build #128503 has started
   > <https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128503/testReport>*
   > for PR 29709 at commit 9844c8c
   > <https://github.com/apache/spark/commit/9844c8cec9bc690036c4d9606d3d23ff5ae52893>
   > .
   >
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/29709#issuecomment-690094656>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AKOBKFDXTF6XLRZDVHD7K3LSFCIQ7ANCNFSM4RESONJA>
   > .
   >
   
   
   -- 
   Bart Samwel
   bart.samwel@databricks.com
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org