You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "thepinetree (via GitHub)" <gi...@apache.org> on 2023/11/15 18:11:12 UTC

[PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]

thepinetree opened a new pull request, #43821:
URL: https://github.com/apache/spark/pull/43821

   ### What changes were proposed in this pull request?
   Spark has a (long-standing) overflow bug in the `sequence` expression.
   
   Consider the following operations:
   ```
   spark.sql("CREATE TABLE foo (l LONG);")
   spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
   spark.sql("SELECT sequence(0, l) FROM foo;").collect()
   ```
   
   The result of these operations will be:
   ```
   Array[org.apache.spark.sql.Row] = Array([WrappedArray()])
   ```
   an unintended consequence of overflow.
   
   The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like:
   
   ```
     0111111111111111111111111111111111111111111111111111111111111111
   - 0000000000000000000000000000000000000000000000000000000000000000 
   ------------------------------------------------------------------
     0111111111111111111111111111111111111111111111111111111111111111
   / 0000000000000000000000000000000000000000000000000000000000000001
   ------------------------------------------------------------------
     0111111111111111111111111111111111111111111111111111111111111111
   + 0000000000000000000000000000000000000000000000000000000000000001
   ------------------------------------------------------------------
     1000000000000000000000000000000000000000000000000000000000000000
   ```
   
   The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`.
   
   Other overflows are similarly problematic.
   
   This PR addresses the issue by checking numeric operations in the length computation for overflow.
   
   ### Why are the changes needed?
   There is a correctness bug from overflow in the `sequence` expression.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Tests added in `CollectionExpressionsSuite.scala`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43821:
URL: https://github.com/apache/spark/pull/43821#issuecomment-1819499327

   Thank you, @thepinetree .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #43821: [SPARK-43393][SQL][3.3] Address sequence expression overflow bug.
URL: https://github.com/apache/spark/pull/43821


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43821:
URL: https://github.com/apache/spark/pull/43821#issuecomment-1819501168

   Merged to branch-3.3 for Apache Spark 3.3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]

Posted by "thepinetree (via GitHub)" <gi...@apache.org>.
thepinetree commented on PR #43821:
URL: https://github.com/apache/spark/pull/43821#issuecomment-1819481652

   @dongjoon-hyun I think this PR is good to go too, not sure why github actions thinks the tests haven't finished (like the others, only the documentation build had issues)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43393][SQL][3.3] Address sequence expression overflow bug. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43821:
URL: https://github.com/apache/spark/pull/43821#issuecomment-1813032065

   Is GitHub Action triggered on this PR, @thepinetree ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org