You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "MaxGekk (via GitHub)" <gi...@apache.org> on 2023/08/18 19:24:01 UTC

[GitHub] [spark] MaxGekk opened a new pull request, #42564: [WIP][SPARK-44840][SQL] Make array_insert 1-based for negative indexes

MaxGekk opened a new pull request, #42564:
URL: https://github.com/apache/spark/pull/42564

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "srielau (via GitHub)" <gi...@apache.org>.

srielau commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300174130


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   The behavior is inconsistent with itself.
   IFF array_insert() and (other functions) were 0-based then yes it would make sense that 1 is the second element and -1 the second last element.
   So to fix this function we either have to make it 0 based or change the behavior for negative indices.
   Against going 0-based speaks:
   1. 1-based was and remains a conscious decision
   2. We can agree (I think) that negative indices are less common as opposed to positive indices
   3. Our own docs actually don't have negative index examples and the instructions in the docs are ambiguous.
   
   Lastly the function is relatively new, the blast radius is limited.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #42564: [SPARK-44840][SQL] Make array_insert 1-based for negative indexes

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #42564:
URL: https://github.com/apache/spark/pull/42564#issuecomment-1686030237

   @cloud-fan @peter-toth @srielau @srowen Could you review this PR, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "srielau (via GitHub)" <gi...@apache.org>.

srielau commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300174130


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   The behavior is inconsistent with itself.
   IFF array_insert() and (other functions) were 0-based then yes it would make sense that 1 is the second element and -1 the second last element.
   So to fix this function we either have to make it 0 based or change the behavior for negative indices.
   Against going 0-based speaks:
   1. 1-based was and remains a conscious decision
   2. We can agree (I think) that negative indices are less common as opposed to positive indices
   3. Our own docs actually don't have negative index examples and the instructions in the docs are ambiguous.
   
   Lastly the function is relatively new, the blast radius is limited.
   
   Re the proposed (existing)semantic "It goes 'before' the position" I don't think we have anything like that anywhere.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300199429


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   Is this the argument: If I insert at position X, then afterwards, element_at X should be that new element?
   Then 0-based vs 1-based indices are not really relevant; as it happens, everyone agrees that -1 is the last element, not -0. If that's the argument then I agree.
   
   If the semantics are meant to be: insert before position X, then the current behavior is correct.
   
   If the semantics are meant to be: insert before X, or after X if X is negative (i.e. 'before' but in reverse). That doesn't sound right.
   
   The docs say 'at position X' which leads me to believe the first interpretation is what was meant, but it doesn't really relate to 0- or 1-based indices.
   
   I agree the blast radius is small, but that cuts both ways.
   
   Can we really say this is a correctness fix? You'd have to argue that everyone has expected the new 'correct' behavior, but were getting the old behavior. But both are plausible, and user code must actually rely on the current behavior now. 
   
   Is there some other context that makes this urgent for _3.5_ which is already at RC2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300199429


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   Is this the argument: If I insert at position X, then afterwards, element_at X should be that new element?
   Then 0-based vs 1-based indices are not really relevant; as it happens, everyone agrees that -1 is the last element, not -0. If that's the argument then I agree.
   
   If the semantics are meant to be: insert before position X, then the current behavior is correct.
   
   If the semantics are meant to be: insert before X, or after X if X is negative (i.e. 'before' but in reverse) then the change is also correct. But that doesn't sound right.
   
   The docs say 'at position X' which leads me to believe the first interpretation is what was meant, but it doesn't really relate to 0- or 1-based indices.
   
   I agree the blast radius is small, but that cuts both ways.
   
   Can we really say this is a correctness fix? You'd have to argue that everyone has expected the new 'correct' behavior, but were getting the old behavior. But both are plausible, and user code must actually rely on the current behavior now. 
   
   Is there some other context that makes this urgent for _3.5_ which is already at RC2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1302006261


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   @srowen You are ok to merge it to branch-3.5, correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300354938


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   Note: `array_insert` with position parameter 0 fails with `INVALID_INDEX_OF_ZERO`. It's nonsense that you can put an element at the beginning with position 1, but you can't put an element at the end with any negative position. This inconsistency is confusing and sometimes makes the function hard to use.
   
   The reason to fix it in 3.5 is to reduce the impact. We don't want to have one more release (3.5.0) that has the wrong behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk closed pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes
URL: https://github.com/apache/spark/pull/42564


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300359886


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   That's a good observation. Personally I buy the argument that the insert position should be where it can be retrieved subsequently. OK I can see the argument to change it in 3.5.0 to limit affected versions. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300065665


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   If we treat it as a correctness bug, then it's fine to fix the result by default, but providing a legacy config to fallback.
   
   To me, it does seem like a correctness bug, as `array_insert` uses 1-based positive index, but 0-based negative index.
   
   There are other examples as well: 
   ```
   Since Spark 3.3, the `histogram_numeric` function in Spark SQL returns an output type of an array of structs (x, y), where the type of the 'x' field in the return value is propagated from the input values consumed in the aggregate function. In Spark 3.2 or earlier, 'x' always had double type. Optionally, use the configuration `spark.sql.legacy.histogramNumericPropagateInputType` since Spark 3.3 to revert back to the previous behavior.
   
   In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.
   ```
   
   We don't wait for the major version bump to fix the wrong behaviors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1299996947


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   I don't think we can change behavior in a minor release. The default has to be to keep current behavior.
   The default could change in 4.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #42564:
URL: https://github.com/apache/spark/pull/42564#discussion_r1300100758


##########
docs/sql-migration-guide.md:
##########
@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.

Review Comment:
   As an aside - I'm still not convinced this is 'incorrect'. The current behavior matches Python's, for example. Is the only other example Snowflake, where inserting at -1 means 'end of list'? If insert's semantics are "insert before position x" then the current behavior seems correct: the last element is element -1, and it ends up in front of it. 
   
   What is the problem, what can't you do, what confusion does the current behavior cause? This has been the behavior for a while and doesn't seem to have been an issue.
   
   We can debate this, but, this is a behavior change, so the bar should be high - especially if suggesting this must happen in a minor version!
   
   The reason I think this is actually more problematic to change is:
   - _anyone_ using negative indices and array_insert will be broken by this change; you _have_ to rely on the 'wrong' behavior to use it, so are expecting it, unlike other correctness bugs
   - The change in behavior is potentially subtle; your query doesn't fail, it just silently starts returning different results, with numbers or strings in a different place than you intend
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #42564: [SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #42564:
URL: https://github.com/apache/spark/pull/42564#issuecomment-1688673650

   Merging to master. The PR causes some conflicts in `branch-3.5`. I will open a separate PR for 3.5.
   Thank you @srielau @cloud-fan @srowen @peter-toth for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org