You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "thepinetree (via GitHub)" <gi...@apache.org> on 2023/05/06 03:49:32 UTC

[GitHub] [spark] thepinetree opened a new pull request, #41072: [WIP][SPARK-43393][SQL] Address sequence expression overflow bug.

thepinetree opened a new pull request, #41072:
URL: https://github.com/apache/spark/pull/41072

   ### What changes were proposed in this pull request?
   Spark has a (long-standing) overflow bug in the `sequence` expression.
   
   Consider the following operations:
   ```
   spark.sql("CREATE TABLE foo (l LONG);")
   spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
   spark.sql("SELECT sequence(0, l) FROM foo;").collect()
   ```
   
   The result of these operations will be:
   ```
   Array[org.apache.spark.sql.Row] = Array([WrappedArray()])
   ```
   an unintended consequence of overflow.
   
   The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like:
   
   ```
     0111111111111111111111111111111111111111111111111111111111111111
   - 0000000000000000000000000000000000000000000000000000000000000000 
   ------------------------------------------------------------------
     0111111111111111111111111111111111111111111111111111111111111111
   / 0000000000000000000000000000000000000000000000000000000000000001
   ------------------------------------------------------------------
     0111111111111111111111111111111111111111111111111111111111111111
   + 0000000000000000000000000000000000000000000000000000000000000001
   ------------------------------------------------------------------
     1000000000000000000000000000000000000000000000000000000000000000
   ```
   
   The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`.
   
   Other overflows are similarly problematic.
   
   This PR addresses the issue by checking numeric operations in the length computation for overflow.
   
   ### Why are the changes needed?
   There is a correctness bug from overflow in the `sequence` expression.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Tests added in `CollectionExpressionsSuite.scala`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1813023418

   @dongjoon-hyun @cloud-fan 
   
   Backport PRs:
   * 3.3: https://github.com/apache/spark/pull/43821
   * 3.4: https://github.com/apache/spark/pull/43819
   * 3.5: https://github.com/apache/spark/pull/43820


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1189490554


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3448,32 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {
+      val delta = Math.subtractExact(stop.toLong, start.toLong)
+      if (delta == Long.MinValue && estimatedStep.toLong == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))
+      require(
+        len <= MAX_ROUNDED_ARRAY_LENGTH,
+        s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH"
+      )
+      len.toInt
+    } catch {
+      // We handle overflows in the previous try block by raising an appropriate exception.
+      case _: ArithmeticException =>
+        val safeLen =
+          BigInt(1) + (BigInt(stop.toLong) - BigInt(start.toLong)) / BigInt(estimatedStep.toLong)
+        require(
+          safeLen <= MAX_ROUNDED_ARRAY_LENGTH,
+          s"Too long sequence: $safeLen. Should be <= $MAX_ROUNDED_ARRAY_LENGTH"
+        )
+        throw new RuntimeException("Unreachable code reached.")

Review Comment:
   if it's not reachable, let's throw `SparkException.internalError`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3448,32 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {
+      val delta = Math.subtractExact(stop.toLong, start.toLong)
+      if (delta == Long.MinValue && estimatedStep.toLong == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))
+      require(
+        len <= MAX_ROUNDED_ARRAY_LENGTH,
+        s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH"
+      )
+      len.toInt
+    } catch {
+      // We handle overflows in the previous try block by raising an appropriate exception.
+      case _: ArithmeticException =>
+        val safeLen =
+          BigInt(1) + (BigInt(stop.toLong) - BigInt(start.toLong)) / BigInt(estimatedStep.toLong)
+        require(

Review Comment:
   let's use explicit `if` to throw the error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1537619872

   cc @gengliangwang FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1392240967


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3122,6 +3122,34 @@ case class Sequence(
 }
 
 object Sequence {
+  private def prettyName: String = "sequence"
+
+  def sequenceLength(start: Long, stop: Long, step: Long): Int = {
+    try {
+      val delta = Math.subtractExact(stop, start)
+      if (delta == Long.MinValue && step == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / step))
+      if (len > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+        throw QueryExecutionErrors.createArrayWithElementsExceedLimitError(prettyName, len)
+      }
+      len.toInt
+    } catch {
+      // We handle overflows in the previous try block by raising an appropriate exception.
+      case _: ArithmeticException =>
+        val safeLen =
+          BigInt(1) + (BigInt(stop) - BigInt(start)) / BigInt(step)
+        if (safeLen > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {

Review Comment:
   maybe just use an assert? Assertion error is also treated as internal errors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1239342677


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3449,30 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {

Review Comment:
   is it possible to add a util function for the code below? then codegen can just call it. we can put the util function in `object Sequence`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1812937385

   Oh this seems to break branch-3.5.
   - https://github.com/apache/spark/actions/runs/6873765275
   
   Let me revert this from branch-3.5. Given the situation, we can start backport from branch-3.5 to branch-3.3 as three separate PRs, @thepinetree .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1189490134


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3448,32 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {
+      val delta = Math.subtractExact(stop.toLong, start.toLong)
+      if (delta == Long.MinValue && estimatedStep.toLong == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))
+      require(
+        len <= MAX_ROUNDED_ARRAY_LENGTH,
+        s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH"
+      )
+      len.toInt
+    } catch {
+      // We handle overflows in the previous try block by raising an appropriate exception.
+      case _: ArithmeticException =>
+        val safeLen =
+          BigInt(1) + (BigInt(stop.toLong) - BigInt(start.toLong)) / BigInt(estimatedStep.toLong)
+        require(

Review Comment:
   let's use explicit `if` to throw the error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] thepinetree commented on a diff in pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1223723240


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3449,32 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {
+      val delta = Math.subtractExact(stop.toLong, start.toLong)
+      if (delta == Long.MinValue && estimatedStep.toLong == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))
+      if (len > MAX_ROUNDED_ARRAY_LENGTH) {
+        throw new IllegalArgumentException(s"Too long sequence: $len. Should be <= " +

Review Comment:
   Ah, I originally misunderstood your comment and with @ankurdave's help I learned about the errors defined in `QueryExecutionErrors.scala` and that they are actually used in this file. I thought `createArrayWithElementsExceedLimitError` might be the most fitting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1811877484

   thanks, merging to master/3.5!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1239343325


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3449,30 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {

Review Comment:
   The util function can just take long type inputs, and caller side is responsible to turn `start`/`stop`/`estimatedStep` to long.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.
URL: https://github.com/apache/spark/pull/41072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.
URL: https://github.com/apache/spark/pull/41072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] thepinetree commented on a diff in pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1210959297


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3449,32 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {
+      val delta = Math.subtractExact(stop.toLong, start.toLong)
+      if (delta == Long.MinValue && estimatedStep.toLong == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))
+      if (len > MAX_ROUNDED_ARRAY_LENGTH) {
+        throw new IllegalArgumentException(s"Too long sequence: $len. Should be <= " +

Review Comment:
   Sorry for the delay. What is the reasoning behind this change? It seems that all errors thrown in this file prefer Java defined exceptions over their Spark wrappers in `SparkException.scala`.
   
   If the decision is to use `Spark` wrappers for this expression only, is `SparkIllegalArgumentException` the right wrapper?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1392242306


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3509,20 +3531,15 @@ object Sequence {
       step: String,
       estimatedStep: String,
       len: String): String = {
-    val longLen = ctx.freshName("longLen")
+    val calcFn = "Sequence.sequenceLength"

Review Comment:
   I'm surprised that this works. Should we do `classOf[Sequence].getName` + ".sequenceLength"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1814835824

   Hi @dongjoon-hyun! Yes, sorry, I thought they'd be a simple back port apart from a simple import conflict. I'll fix the compilation errors appropriately when I have a chance tonight.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1195144058


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3448,13 +3449,32 @@ object Sequence {
         || (estimatedStep == num.zero && start == stop),
       s"Illegal sequence boundaries: $start to $stop by $step")
 
-    val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / estimatedStep.toLong
-
-    require(
-      len <= MAX_ROUNDED_ARRAY_LENGTH,
-      s"Too long sequence: $len. Should be <= $MAX_ROUNDED_ARRAY_LENGTH")
-
-    len.toInt
+    try {
+      val delta = Math.subtractExact(stop.toLong, start.toLong)
+      if (delta == Long.MinValue && estimatedStep.toLong == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))
+      if (len > MAX_ROUNDED_ARRAY_LENGTH) {
+        throw new IllegalArgumentException(s"Too long sequence: $len. Should be <= " +

Review Comment:
   If the exception is an user-facing error, let's introduce an error class, and raise a Spark exception w/ it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #41072: [SPARK-43393][SQL] Address sequence expression overflow bug.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1742252815

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1813024170

   Thank you so much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1815098822

   Thank you so much, @thepinetree !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1814818350

   Gentle ping, @thepinetree . All backporting PRs are broken at the compilation stage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1817222415

   quick update @dongjoon-hyun, looks like the 3.4/3.5 backports should be good to go after some flakiness is resolved (documentation and one spark connect suite).
   
   3.4/3.5 had an older version of the errors (same across both versions) and I made changes accordingly.
   3.3 had even older error definitions and needed some more changes on top of that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1813257544

   Could you fix the compilation of your PRs, @thepinetree ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1392991421


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3122,6 +3122,34 @@ case class Sequence(
 }
 
 object Sequence {
+  private def prettyName: String = "sequence"
+
+  def sequenceLength(start: Long, stop: Long, step: Long): Int = {
+    try {
+      val delta = Math.subtractExact(stop, start)
+      if (delta == Long.MinValue && step == -1L) {
+        // We must special-case division of Long.MinValue by -1 to catch potential unchecked
+        // overflow in next operation. Division does not have a builtin overflow check. We
+        // previously special-case div-by-zero.
+        throw new ArithmeticException("Long overflow (Long.MinValue / -1)")
+      }
+      val len = if (stop == start) 1L else Math.addExact(1L, (delta / step))
+      if (len > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+        throw QueryExecutionErrors.createArrayWithElementsExceedLimitError(prettyName, len)
+      }
+      len.toInt
+    } catch {
+      // We handle overflows in the previous try block by raising an appropriate exception.
+      case _: ArithmeticException =>
+        val safeLen =
+          BigInt(1) + (BigInt(stop) - BigInt(start)) / BigInt(step)
+        if (safeLen > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {

Review Comment:
   I personally like the current exception better since it's more descriptive of the actual problem -- trying to create too large an array (with the user's intended size) and what the limit is. If strong opinion, I can change to an assertion.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3509,20 +3531,15 @@ object Sequence {
       step: String,
       estimatedStep: String,
       len: String): String = {
-    val longLen = ctx.freshName("longLen")
+    val calcFn = "Sequence.sequenceLength"

Review Comment:
   Not sure exactly how, but my hypothesis is that these factors all play a role:
   * Sequence object ends up in the same compilation unit as this function (not sure if this is expected)
   * `sequenceLength` function is effectively static in Scala and publicly accessible
   
   I noticed some other functions in this file do this as well -- e.g. https://github.com/apache/spark/blob/128f5523194d5241c7b0f08b5be183288128ba16/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L613
   
   I do like your suggestion better though, cleaner and easier to understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on code in PR #41072:
URL: https://github.com/apache/spark/pull/41072#discussion_r1392988744


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##########
@@ -3509,20 +3531,15 @@ object Sequence {
       step: String,
       estimatedStep: String,
       len: String): String = {
-    val longLen = ctx.freshName("longLen")
+    val calcFn = "Sequence.sequenceLength"

Review Comment:
   Not sure exactly how, but my hypothesis is that these factors all play a role:
   * Sequence object ends up in the same compilation unit as this function (not sure if this is expected)
   * `sequenceLength` function is effectively static in Scala and publicly accessible
   
   I noticed some other functions in this file do this as well -- e.g. https://github.com/apache/spark/blob/128f5523194d5241c7b0f08b5be183288128ba16/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L613
   
   I do like this better though, cleaner and easier to understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1811916481

   Thank you, @thepinetree and @cloud-fan . Given that this is a long-standing overflow bug, do you think we can have this fix in other live release branches, `branch-3.4` and `branch-3.3`? Especially, I'm interested in `branch-3.4` as the release manager of Apache Spark 3.4.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL] Address sequence expression overflow bug. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #41072:
URL: https://github.com/apache/spark/pull/41072#issuecomment-1811989510

   SGTM. @thepinetree can you help to create backport PRs? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org