You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/29 05:38:36 UTC

[GitHub] [iceberg] pan3793 opened a new pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

pan3793 opened a new pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203


   Leverage SPARK-32056 to coalesce partitions for repartition by expressions when AQE is enabled in Spark 3.1
   
   Because that `RequiresDistributionAndOrdering` is introduced in SPARK-33779, which is part of Spark 3.2, https://github.com/apache/iceberg/pull/2512#discussion_r658040759 is not suitable for Spark 3.1
   
   This PR pick the idea of https://github.com/apache/iceberg/pull/2512#discussion_r657095040


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720627165



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None
+    }
+  }
+
   def createRepartitionByExpression(
       partitionExpressions: Seq[Expression],
       child: LogicalPlan,
-      numPartitions: Int): RepartitionByExpression = {
+      numPartitions: Option[Int]): RepartitionByExpression = {
     if (Spark3VersionUtil.isSpark30) {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
+      numPartitions match {
+        case Some(num) =>
+          repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(num))
+        case None =>
+          throw new IllegalArgumentException("numPartitions is required before SPARK-32056")

Review comment:
       > It doesn't make sense to me that there is a case where we would allow passing an incorrect value here.
   
   The check is redundant for now, because we only use it in one place, and make sure to call it correctly. But I think `createRepartitionByExpression` is likely to be used in other places in the future, add a check here makes it safer.
   
   > any version other that 3.0 will not set shuffle partitions
   
   It's not correct after SPARK-33779 (part of Spark 3.2), https://github.com/apache/iceberg/pull/2512#discussion_r658040759
   
   Are we intend to reuse the code for Spark 3.2?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720625544



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None

Review comment:
       We don't need to do it, see https://github.com/apache/spark/pull/28900/files#diff-0ac2c89f8cc0d00e8fe7717b01697f36f20fe8abf2def09bcfde0ad84b30e467R968




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-950098038


   > @pan3793 you mentioned on the mailing list that you wanted to see this included in the 0.12.1 release. As it's been merged a while ago, can you check if this PR's concerns are still handled given that we've refactored the Spark module layout?
   > 
   > E.g. that this is corrected in the current `spark/v3.0`? I believe that it is.
   
   The refactor doesn't break this PR's change. Since the change only affects Spark 3.1, I have checked `PlanUtils`, it's correct in `spark/v3.0`. As for `spark/v3.2`, lots of code has been removed from `PlanUtils`, the PR's change affects nothing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r721326924



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None
+    }
+  }
+
   def createRepartitionByExpression(
       partitionExpressions: Seq[Expression],
       child: LogicalPlan,
-      numPartitions: Int): RepartitionByExpression = {
+      numPartitions: Option[Int]): RepartitionByExpression = {
     if (Spark3VersionUtil.isSpark30) {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
+      numPartitions match {
+        case Some(num) =>
+          repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(num))
+        case None =>
+          throw new IllegalArgumentException("numPartitions is required before SPARK-32056")

Review comment:
       Updated as your suggestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r723406011



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -61,7 +61,8 @@ object PlanUtils {
     if (Spark3VersionUtil.isSpark30) {
       repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
     } else {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Some(numPartitions))
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled

Review comment:
       Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720627165



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None
+    }
+  }
+
   def createRepartitionByExpression(
       partitionExpressions: Seq[Expression],
       child: LogicalPlan,
-      numPartitions: Int): RepartitionByExpression = {
+      numPartitions: Option[Int]): RepartitionByExpression = {
     if (Spark3VersionUtil.isSpark30) {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
+      numPartitions match {
+        case Some(num) =>
+          repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(num))
+        case None =>
+          throw new IllegalArgumentException("numPartitions is required before SPARK-32056")

Review comment:
       > It doesn't make sense to me that there is a case where we would allow passing an incorrect value here.
   
   The check is redundant for now, because we only use it in one place, and make sure to call it correctly. But I think `createRepartitionByExpression` is likely to be used in the future, add a check here makes it safer.
   
   > any version other that 3.0 will not set shuffle partitions
   
   It's not correct after SPARK-33779 (part of Spark 3.2), https://github.com/apache/iceberg/pull/2512#discussion_r658040759
   
   Are we intend to reuse the code for Spark 3.2?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720593233



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None
+    }
+  }
+
   def createRepartitionByExpression(
       partitionExpressions: Seq[Expression],
       child: LogicalPlan,
-      numPartitions: Int): RepartitionByExpression = {
+      numPartitions: Option[Int]): RepartitionByExpression = {
     if (Spark3VersionUtil.isSpark30) {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
+      numPartitions match {
+        case Some(num) =>
+          repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(num))
+        case None =>
+          throw new IllegalArgumentException("numPartitions is required before SPARK-32056")

Review comment:
       It doesn't make sense to me that there is a case where we would allow passing an incorrect value here. Rather than creating the `numShufflePartitions` method, checking for Spark 3.0 in both places, and then adding this check, why not just change the behavior so that the number of shuffle partitions is an int, but is ignored unless this is Spark 3.0?
   
   That seems like a more direct route to the same behavior -- any version other that 3.0 will not set shuffle partitions -- but it doesn't require coordination across methods or unnecessary error cases like this one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720625544



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None

Review comment:
       We don't need to do that, see https://github.com/apache/spark/pull/28900/files#diff-0ac2c89f8cc0d00e8fe7717b01697f36f20fe8abf2def09bcfde0ad84b30e467R968




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r721551323



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -61,7 +61,8 @@ object PlanUtils {
     if (Spark3VersionUtil.isSpark30) {
       repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
     } else {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Some(numPartitions))
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled

Review comment:
       This comment isn't very helpful, it just describes a Spark issue. Could you remove it or update it to something that explains a part of the next line? Something like "Do not pass numPartitions because it is set automatically for AQE"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-936538233


   Thanks, @pan3793! I merged this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-930185064


   This looks good to me, I only wonder if we have a way to test it ... so i'm +1 unless @aokolnychyi had other plans. I think since our new Spark support looks like we are going to be doing version specific builds I think we are fine with this here for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-949843554


   @pan3793 you mentioned on the mailing list that you wanted to see this included in the 0.12.1 release. As it's been merged a while ago, can you check if this PR's concerns are still handled given that we've refactored the Spark module layout?
   
   E.g. that this is corrected in the current `spark/v3.0`? I believe that it is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720853040



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None
+    }
+  }
+
   def createRepartitionByExpression(
       partitionExpressions: Seq[Expression],
       child: LogicalPlan,
-      numPartitions: Int): RepartitionByExpression = {
+      numPartitions: Option[Int]): RepartitionByExpression = {
     if (Spark3VersionUtil.isSpark30) {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
+      numPartitions match {
+        case Some(num) =>
+          repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(num))
+        case None =>
+          throw new IllegalArgumentException("numPartitions is required before SPARK-32056")

Review comment:
       The logic I'm suggesting is a simpler way to get the same behavior here. I think we should go with that for now and update for 3.2 later since we will need to anyway. That isn't changed by my suggestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720627165



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None
+    }
+  }
+
   def createRepartitionByExpression(
       partitionExpressions: Seq[Expression],
       child: LogicalPlan,
-      numPartitions: Int): RepartitionByExpression = {
+      numPartitions: Option[Int]): RepartitionByExpression = {
     if (Spark3VersionUtil.isSpark30) {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
+      numPartitions match {
+        case Some(num) =>
+          repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(num))
+        case None =>
+          throw new IllegalArgumentException("numPartitions is required before SPARK-32056")

Review comment:
       > It doesn't make sense to me that there is a case where we would allow passing an incorrect value here.
   
   The check is redundant for now, because we only use it in one place, and make sure to call it correctly. But I think `createRepartitionByExpression` is likely to be used in other places in the future, add a check here makes it safer.
   
   > any version other that 3.0 will not set shuffle partitions
   
   It's not correct after SPARK-33779 (part of Spark 3.2), https://github.com/apache/iceberg/pull/2512#discussion_r658040759
   
   Are we intend to reuse this code for Spark 3.2?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r719842927



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None

Review comment:
       Do we need to check if AQE is enabled? E.g. if a user disables AQE, should this still return `Some(conf.numShufflePartitions)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-929866158






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-929866158


   cc @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-930185064


   This looks good to me, I only wonder if we have a way to test it ... so i'm +1 unless @aokolnychyi had other plans. I think since our new Spark support looks like we are going to be doing version specific builds I think we are fine with this here for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
pan3793 commented on pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#issuecomment-930201196


   > I only wonder if we have a way to test it ... 
   
   Thanks, @RussellSpitzer. Have a quick look, I don't see any catalyst `Rule` test suites in the codebase. 
   The change is about performance not correctness, I think pass the existing test suites means the change will not break the correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3203: Spark: Leverage SPARK-32056 in Spark 3.1

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3203:
URL: https://github.com/apache/iceberg/pull/3203#discussion_r720593233



##########
File path: spark3/src/main/scala/org/apache/spark/sql/catalyst/utils/PlanUtils.scala
##########
@@ -54,14 +55,28 @@ object PlanUtils {
         Integer.TYPE)
       .build()
 
+  def numShufflePartitions(conf: SQLConf): Option[Int] = {
+    if (Spark3VersionUtil.isSpark30) {
+      Some(conf.numShufflePartitions)
+    } else {
+      // SPARK-32056: Coalesce partitions for repartition by expressions when AQE is enabled
+      None
+    }
+  }
+
   def createRepartitionByExpression(
       partitionExpressions: Seq[Expression],
       child: LogicalPlan,
-      numPartitions: Int): RepartitionByExpression = {
+      numPartitions: Option[Int]): RepartitionByExpression = {
     if (Spark3VersionUtil.isSpark30) {
-      repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(numPartitions))
+      numPartitions match {
+        case Some(num) =>
+          repartitionByExpressionCtor.newInstance(partitionExpressions, child, Integer.valueOf(num))
+        case None =>
+          throw new IllegalArgumentException("numPartitions is required before SPARK-32056")

Review comment:
       It doesn't make sense to me that there is a case where we would allow passing an incorrect value here. Rather than creating the `numShufflePartitions` method, checking for Spark 3.0 in both places, and then adding this check, why not just change the behavior so that the number of shuffle partitions is an int, but is ignored unless this is Spark 3.0?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org