You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/06 12:24:04 UTC

[GitHub] [spark] cloud-fan opened a new pull request #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

cloud-fan opened a new pull request #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834
 
 
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This PR reverts https://github.com/apache/spark/pull/26051 and https://github.com/apache/spark/pull/26066
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.
   
   ### Does this PR introduce any user-facing change?
   <!--
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If no, write 'No'.
   -->
   Yes, change the behavior of `size(null)` back to be the same as 2.4.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   N/A

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595901992
 
 
   I prefer the AS-IS status of 3.0, but this seems to fall into the same category where @marmbrus and @gatorsmile asked reverting on the behavior fix at two-parameter `TRIM` function.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595747514
 
 
   **[Test build #119465 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119465/testReport)** for PR 27834 at commit [`9ebf183`](https://github.com/apache/spark/commit/9ebf183e0b25a93c3c0ec2439f7005600f6957e9).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598548623
 
 
   For me, it looks better, @HeartSaVioR . Please make a PR for that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598550554
 
 
   Since it has already `legacy` in the name, `spark.sql.legacy.sizeOfNull`, `true` just means preserving the old behavior, whatever it was. So, in that sense, the current name is also okay.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595865236
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595901992
 
 
   I prefer the AS-IS status of 3.0, but this seems to fall into the same category where @marmbrus and @gatorsmile asked reverting on the behavior fix at two-parameter `TRIM` function.
   
   Technically, two-parameter `TRIM` was a correctness fix. However, we reverted that by preserving the old behavior because two-parameter `TRIM` was not a SQL standard.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-596154932
 
 
   It doesn't seem to be just a cosmetic change, as far as we hear actual use case being affected by. It would be OK to let it be -1 for non-aggregate given -1 can be still differentiated with valid values, but for aggregations the value of -1 is being handled as "valid" values silently, and provides correctness issue. That would require "pre-process" (having a column for the result of `size(col)`, but change to NULL if the value is -1) or hacky workaround.
   
   I feel we should have clear reason to have -1 for the return value, what benefits we get from having it to -1. At least they should be a kind of "trade-off" if we would like to decide and take one - if it doesn't even a trade-off, it's clearly a correctness issue we should fix.
   
   @ssimeonov 
   Could you please share the details on workaround you took?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595865236
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595747942
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598639014
 
 
   Ah OK, you were faster. Thanks for the info.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595747942
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595895765
 
 
   Thanks @MaxGekk. Yes, after discovering this, we had to comb from our entire codebase to refactor `size(column)` into `column.size`, adding our own implicit. 
   
   Taking a step back, a `-1` return from a function reeks of non-SQL thinking by whoever implemented this in Hive originally. As Spark grows, its audience is expanding rapidly. We need to move away from random engineer decisions made many years ago in a different product towards what makes sense for the future (much larger) audience of Spark users.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595864346
 
 
   **[Test build #119465 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119465/testReport)** for PR 27834 at commit [`9ebf183`](https://github.com/apache/spark/commit/9ebf183e0b25a93c3c0ec2439f7005600f6957e9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595747946
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24197/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595751507
 
 
   > returning -1 looks reasonable as well.
   
   For the full picture,  -1 may lead to wrong query results. Here is the example I got from @ssimeonov :
   
   "A client discovered this behavior while investigating post-click user engagement in their AdTech system. The schema was per ad placement and post-click user engagements were in an array of structs. The culprit was **df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), ...)**, which subtracted 1 for every click without post-click engagement. Luckily, the behavior led to negative engagement counts in some periods, which alerted them to the problem and this bizarre behavior."
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan edited a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
cloud-fan edited a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595745817
 
 
   cc @rxin @MaxGekk @gatorsmile @HyukjinKwon @dongjoon-hyun 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ssimeonov commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
ssimeonov commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598561408
 
 
   We made our workaround long before this made it to Spark. After all the code was changed, it was easier to keep working in our own framework, ignoring all the mess of settings.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598538807
 
 
   Btw, the configuration name `spark.sql.legacy.sizeOfNull` doesn't seem to be intuitive. It doesn't contain the information how it will behave. It's more confusing because now the default is true even the configuration is to retain legacy behavior.
   
   How about renaming it as `spark.sql.legacy.sizeOfNullAsMinusOne` and set it to true (so that no translation is needed)? Personally I prefer `spark.sql.legacy.sizeOfNullAsNull` and set it to false (so that we need to reverse in translation) so either is fine for me. Once we make configuration be intuitive we could guide to use it instead.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595745817
 
 
   cc @rxin @MaxGekk @gatorsmile @HyukjinKwon 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598549382
 
 
   Oh..

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598559988
 
 
   Oh OK. I'm not sure I love to have `legacy` configurations which would lead end users to remember the old behavior (it would be nice to have its right name, and we can still control the behavior via default value), but given it has been one of the conventions, looks OK for now.
   
   One thing we may want to have attention is that, @ssimeonov did their own workaround while the behavior can be simply changed via touching the legacy config.
   
   @ssimeonov Did you indicate the existence of legacy config? If then, could you please elaborate why you didn't leverage the config? Thanks in advance!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595865257
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119465/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-596159602
 
 
   @HeartSaVioR the workaround was simple, unpleasant and unstable:
   
   - **Simple**, because we realized that we should never use `size()` as its behavior is (a) inconsistent with SQL principles and (b) dangerous for analytics.
   
   - **Unpleasant**, because it involved replacing every single use of the function in our codebase with a `Column` implicit that we created (`size(x)` -> `x.size`). This was easy for IDE-managed code but hard for notebook code. After consideration, we decided that our implicit would return 0 for `null`, because in every single use case we could think of for our data, we would be converting null returns to 0. (I would not recommend this for Spark as the behavior is inconsistent with SQL.) This is also the reason why we went with an implicit as opposed to, say, using import order to hide `functions.size()` with our own `size()`. Caveat: if we had more people using SparkSQL, we would have gone with `size0(x)` and `sizen(x)` instead of an implicit, for version that return 0 or null for null input.
   
   - **Unstable**, because there is no way to guarantee the problem will not happen again due to user error. It's part of our DOs and DONTs of Spark not to use `size()` but discipline alone does not offer guaranteed protection. The only way to guarantee the behavior is via a change to Spark. A settings-driven change is an excellent way to have the best of both worlds: sane & consistent behavior for future users and an easy backward compatibility mode for others.
   
   As for the idea to have the `size()` behavior differ for aggregation vs. non-aggregation processing, I think this would be a bad pattern to introduce into an already very complex system such as Spark.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595865257
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119465/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595747514
 
 
   **[Test build #119465 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119465/testReport)** for PR 27834 at commit [`9ebf183`](https://github.com/apache/spark/commit/9ebf183e0b25a93c3c0ec2439f7005600f6957e9).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598559988
 
 
   Oh OK. I'm not sure I agree to have `legacy` configurations which would lead end users to remember the old behavior (it would be nice to have its right name, and we can still control the behavior via default value), but given it has been one of the conventions, looks OK for now.
   
   One thing we may want to have attention is that, @ssimeonov did their own workaround while the behavior can be simply changed via touching the legacy config.
   
   @ssimeonov Did you indicate the existence of legacy config? If then, could you please elaborate why you didn't leverage the config? Thanks in advance!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-598549594
 
 
   This exists at 2.4.x age. It seems that it's difficult to change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-596161322
 
 
   > As for the idea to have the size() behavior differ for aggregation vs. non-aggregation processing
   
   No I didn't mean it as an idea. I used it to describe the issue where NULL as result wouldn't have.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595897554
 
 
   Thank you for pinging me, @cloud-fan .

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-597747132
 
 
   Thank you all for making and reviewing this PR. Since the vote finished, I'll merge this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-595747946
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24197/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27834: [SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-597754180
 
 
   cc @rxin since he is a release manager for 3.0.0. (Also, cc @gatorsmile )

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org