You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/22 11:44:53 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #35609: [SPARK-38286][SQL] check Union's maxRows and maxRowsPerPartition

zhengruifeng opened a new pull request #35609:
URL: https://github.com/apache/spark/pull/35609


   ### What changes were proposed in this pull request?
   check Union's maxRows and maxRowsPerPartition
   
   
   ### Why are the changes needed?
   Union's maxRows and maxRowsPerPartition may overflow:
   ```
   scala> val df1 = spark.range(0, Long.MaxValue, 1, 1)
   df1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> val df2 = spark.range(0, 100, 1, 10)
   df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> val union = df1.union(df2)
   union: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> union.queryExecution.logical.maxRowsPerPartition
   res19: Option[Long] = Some(-9223372036854775799)
   
   scala> union.queryExecution.logical.maxRows
   res20: Option[Long] = Some(-9223372036854775709)
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   added testsuite


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #35609: [SPARK-38286][SQL] Union's maxRows and maxRowsPerPartition may overflow

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #35609:
URL: https://github.com/apache/spark/pull/35609#discussion_r813463402



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -277,11 +277,18 @@ case class Union(
   assert(!allowMissingCol || byName, "`allowMissingCol` can be true only if `byName` is true.")
 
   override def maxRows: Option[Long] = {
-    if (children.exists(_.maxRows.isEmpty)) {

Review comment:
       since 2.0, https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L190-L197




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #35609: [SPARK-38286][SQL] check Union's maxRows and maxRowsPerPartition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #35609:
URL: https://github.com/apache/spark/pull/35609#issuecomment-1047713774


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35609: [SPARK-38286][SQL] Union's maxRows and maxRowsPerPartition may overflow

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35609:
URL: https://github.com/apache/spark/pull/35609#discussion_r812966259



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -277,11 +277,18 @@ case class Union(
   assert(!allowMissingCol || byName, "`allowMissingCol` can be true only if `byName` is true.")
 
   override def maxRows: Option[Long] = {
-    if (children.exists(_.maxRows.isEmpty)) {

Review comment:
       do you know which Spark version we start to have `Union.maxRows`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35609: [SPARK-38286][SQL] Union's maxRows and maxRowsPerPartition may overflow

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35609:
URL: https://github.com/apache/spark/pull/35609#issuecomment-1048857734


   cc @sigmod 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #35609: [SPARK-38286][SQL] Union's maxRows and maxRowsPerPartition may overflow

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #35609:
URL: https://github.com/apache/spark/pull/35609


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #35609: [SPARK-38286][SQL] Union's maxRows and maxRowsPerPartition may overflow

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #35609:
URL: https://github.com/apache/spark/pull/35609#issuecomment-1048504368


   multi-table join may generate large `maxRows`, and union those joins may cause `maxRows` overflow, which may result in unexpected behavior in optimizer.
   
   this also seems a correctness issue.  @cloud-fan 
   
   
   the failed doc test also exist in other PRs, should be irrelative


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35609: [SPARK-38286][SQL] Union's maxRows and maxRowsPerPartition may overflow

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35609:
URL: https://github.com/apache/spark/pull/35609#issuecomment-1049444309


   thanks, merging to master/3.2/3.1/3.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org