You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2020/02/14 19:35:25 UTC

[DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Hi, All.

I'm sending this email because the Apache Spark committers had better have
a consistent point of views for the upcoming PRs. And, the community policy
is the way to lead the community members transparently and clearly for a
long term good.

First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
Spark 3.0.0 is going to achieve many improvement in SQL areas.

Especially, we invested lots of efforts to improve SQL ANSI compliance and
related test coverage for the future. One simple example is the following.

    [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

With this support, we are able to move away from one of esoteric Spark
function `TRIM/LTRIM/RTRIM`.
(Note that the above syntax is ANSI standard, but we have additional
non-standard these functions, too.)

Here is the summary of the current status.

    +------------------------+-------------+---------------+
    | SQL Progressing Engine | TRIM Syntax | TRIM Function |
    +------------------------------------------------------+
    | Spark 3.0.0-preview/2  |      O      |       O       |
    | PostgreSQL             |      O      |       O       |
    | Presto                 |      O      |       O       |
    | MySQL                  |      O(*)   |       X       |
    | Oracle                 |      O      |       X       |
    | MsSQL                  |      O      |       X       |
    | Terradata              |      O      |       X       |
    | Hive                   |      X      |       X       |
    | Spark 1.6 ~ 2.2        |      X      |       X       |
    | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
    +------------------------+-------------+---------------+
    * MySQL doesn't allow multiple trim character
    * Spark 2.3 ~ 2.4 have the function in a different way.

Here is the illustrative example of the problem.

    postgres=# SELECT trim('yxTomxx', 'xyz');
    btrim
    -------
    Tom

    presto:default> SELECT trim('yxTomxx', 'xyz');
    _col0
    -------
    Tom

    spark-sql> SELECT trim('yxTomxx', 'xyz');
    z

Here is our history to fix the above issue.

    [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
    1. https://github.com/apache/spark/pull/24902
       (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
released.)
    2. https://github.com/apache/spark/pull/24907
       (Merged 2019-06-20 for v2.3.4, but reverted)
    3. https://github.com/apache/spark/pull/24908
       (Merged 2019-06-21 for v2.4.4, but reverted)

(2) and (3) were reverted before releases because we didn't want to fix
that in the maintenance releases. Please see the following references of
the decision.

    https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3)
    https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4)

Now, there are some requests to revert SPARK-28093 and to keep these
esoteric functions for backward compatibility and the following reason.

    > Reordering function parameters to match another system,
    > for a method that is otherwise working correctly,
    > sounds exactly like a cosmetic change to me.

    > How can we silently change the parameter of an existing SQL function?
    > I don't think this is a correctness issue as the SQL standard
    > doesn't say that the function signature have to be trim(srcStr,
trimStr).

The concern and the point of views make sense.

My concerns are the followings.

    1. This kind of esoteric differences are called `vendor-lock-in` stuffs
in a negative way.
       - It's difficult for new users to understand.
       - It's hard to migrate between Apache Spark and the others.
    2. Although we did our bests, Apache Spark SQL has not been enough
always.
    3. We need to do improvement in the future releases.

In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in
order to keep this vendor-lock-in stuffs for backward-compatibility.

What I want is building a consistent point of views in this category for
the upcoming PR reviews.

If we have a clear policy, we can save future community efforts in many
ways.

Bests,
Dongjoon.

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Michael Armbrust <mi...@databricks.com>.

This plan for evolving the TRIM function to be more standards compliant
sounds much better to me than the original change to just switch the order.
It pushes users in the right direction and cleans up our tech debt without
silently breaking existing workloads. It means that programs won't return
different results when run on Spark 2.x and Spark 3.x.

One caveat:

> If we keep this situation in 3.0.0 release (a major release), it means
>>> Apache Spark will be forever
>>>
>>
I think this ship has already sailed. There is nothing special about 3.0
here. If the API is in a released version of Spark, than the mistake is
already made.

Major releases are an opportunity to break APIs when we *have to*. We
always strive to avoid breaking APIs even if they have not been in an X.0
release.

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Dongjoon Hyun <do...@gmail.com>.

Yes, right. We need to choose the best way case by case.
The following level of `Runtime warning` also sounds reasonable to me.

> Maybe we should only log the warning once per Spark application.

Bests,
Dongjoon.

On Tue, Feb 18, 2020 at 1:13 AM Wenchen Fan <cl...@gmail.com> wrote:

> I don't know what's the best way to deprecate an SQL function. Runtime
> warning can be annoying if it keeps coming out. Maybe we should only log
> the warning once per Spark application.
>
> On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Thank you for feedback, Wenchen, Maxim, and Takeshi.
>>
>>     1. "we would also promote the SQL standard TRIM syntax"
>>     2. "we could output a warning with a notice about the order of
>> parameters"
>>     3. "it looks nice to make these (two parameters) trim functions
>> unrecommended in future releases"
>>
>> Yes, in case of reverting SPARK-28093, we had better deprecate these
>> two-parameter SQL function invocations. It's because this is really
>> esoteric and even inconsistent inside Spark (Scala API also follows
>> PostgresSQL/Presto style like the following.)
>>
>>     def trim(e: Column, trimString: String)
>>
>> If we keep this situation in 3.0.0 release (a major release), it means
>> Apache Spark will be forever.
>> And, it becomes worse and worse because it leads more users to fall into
>> this trap.
>>
>> There are a few deprecation ways. I believe (3) can be a proper next step
>> in case of reverting because (2) is infeasible and (1) is considered
>> insufficient because it's silent when we do SPARK-28093. We need non-silent
>> (noisy) one in this case. Technically, (3) can be done in
>> `Analyzer.ResolveFunctions`.
>>
>>     1. Documentation-only: Removing example and add migration guide
>>     2. Compile-time warning by annotation: This is not an option for SQL
>> function in SQL string.
>>     3. Runtime warning with a directional guide
>>        (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")
>>
>> How do you think about (3)?
>>
>> Bests,
>> Dongjoon.
>>
>> On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <li...@gmail.com>
>> wrote:
>>
>>> The revert looks fine to me for keeping the compatibility.
>>> Also, I think the different orders between the systems easily lead to
>>> mistakes, so
>>> , as Wenchen suggested, it looks nice to make these (two parameters)
>>> trim functions
>>> unrecommended in future releases:
>>> https://github.com/apache/spark/pull/27540#discussion_r377682518
>>> Actually, I think the SQL TRIM syntax is enough for trim use cases...
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <ma...@databricks.com>
>>> wrote:
>>>
>>>> Also if we look at possible combination of trim parameters:
>>>> 1. foldable srcStr + foldable trimStr
>>>> 2. foldable srcStr + non-foldable trimStr
>>>> 3. non-foldable srcStr + foldable trimStr
>>>> 4. non-foldable srcStr + non-foldable trimStr
>>>>
>>>> The case # 2 seems a rare case, and # 3 is probably the most common
>>>> case. Once we see the second case, we could output a warning with a notice
>>>> about the order of parameters.
>>>>
>>>> Maxim Gekk
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks, Inc.
>>>>
>>>>
>>>> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> It's unfortunate that we don't have a clear document to talk about
>>>>> breaking changes (I'm working on it BTW). I believe the general guidance
>>>>> is: *avoid breaking changes unless we have to*. For example, the
>>>>> previous result was so broken that we have to fix it, moving to Scala 2.12
>>>>> makes us have to break some APIs, etc.
>>>>>
>>>>> For this particular case, do we have to switch the parameter order?
>>>>> It's different from some systems, the order was not decided explicitly, but
>>>>> I don't think they are strong reasons. People from RDBMS should use the SQL
>>>>> standard TRIM syntax more often. People using prior Spark versions should
>>>>> have figured out the parameter order of Spark TRIM (there was no document)
>>>>> and adjusted their queries. There is no such standard that defines the
>>>>> parameter order of the TRIM function.
>>>>>
>>>>> In the long term, we would also promote the SQL standard TRIM syntax.
>>>>> I don't see much benefit of "fixing" the parameter order that worth to make
>>>>> a breaking change.
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters
>>>>>> and TRIM(trimStr FROM str) syntax.
>>>>>>
>>>>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>>>>
>>>>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> I'm sending this email because the Apache Spark committers had
>>>>>>> better have a consistent point of views for the upcoming PRs. And, the
>>>>>>> community policy is the way to lead the community members transparently and
>>>>>>> clearly for a long term good.
>>>>>>>
>>>>>>> First of all, I want to emphasize that, like Apache Spark 2.0.0,
>>>>>>> Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>>>>>>
>>>>>>> Especially, we invested lots of efforts to improve SQL ANSI
>>>>>>> compliance and related test coverage for the future. One simple example is
>>>>>>> the following.
>>>>>>>
>>>>>>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>>>>>>
>>>>>>> With this support, we are able to move away from one of esoteric
>>>>>>> Spark function `TRIM/LTRIM/RTRIM`.
>>>>>>> (Note that the above syntax is ANSI standard, but we have additional
>>>>>>> non-standard these functions, too.)
>>>>>>>
>>>>>>> Here is the summary of the current status.
>>>>>>>
>>>>>>>     +------------------------+-------------+---------------+
>>>>>>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>>>>>>     +------------------------------------------------------+
>>>>>>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>>>>>>     | PostgreSQL             |      O      |       O       |
>>>>>>>     | Presto                 |      O      |       O       |
>>>>>>>     | MySQL                  |      O(*)   |       X       |
>>>>>>>     | Oracle                 |      O      |       X       |
>>>>>>>     | MsSQL                  |      O      |       X       |
>>>>>>>     | Terradata              |      O      |       X       |
>>>>>>>     | Hive                   |      X      |       X       |
>>>>>>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>>>>>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>>>>>>     +------------------------+-------------+---------------+
>>>>>>>     * MySQL doesn't allow multiple trim character
>>>>>>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>>>>>>
>>>>>>> Here is the illustrative example of the problem.
>>>>>>>
>>>>>>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>>>>>>     btrim
>>>>>>>     -------
>>>>>>>     Tom
>>>>>>>
>>>>>>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>>>>>>     _col0
>>>>>>>     -------
>>>>>>>     Tom
>>>>>>>
>>>>>>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>>>>>     z
>>>>>>>
>>>>>>> Here is our history to fix the above issue.
>>>>>>>
>>>>>>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>>>>>> issue
>>>>>>>     1. https://github.com/apache/spark/pull/24902
>>>>>>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and
>>>>>>> 3.0.0-preview2 released.)
>>>>>>>     2. https://github.com/apache/spark/pull/24907
>>>>>>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>>>>>>     3. https://github.com/apache/spark/pull/24908
>>>>>>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>>>>>>
>>>>>>> (2) and (3) were reverted before releases because we didn't want to
>>>>>>> fix that in the maintenance releases. Please see the following references
>>>>>>> of the decision.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>>>>>> (2.3)
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>>>>>> (2.4)
>>>>>>>
>>>>>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>>>>>> esoteric functions for backward compatibility and the following reason.
>>>>>>>
>>>>>>>     > Reordering function parameters to match another system,
>>>>>>>     > for a method that is otherwise working correctly,
>>>>>>>     > sounds exactly like a cosmetic change to me.
>>>>>>>
>>>>>>>     > How can we silently change the parameter of an existing SQL
>>>>>>> function?
>>>>>>>     > I don't think this is a correctness issue as the SQL standard
>>>>>>>     > doesn't say that the function signature have to be
>>>>>>> trim(srcStr, trimStr).
>>>>>>>
>>>>>>> The concern and the point of views make sense.
>>>>>>>
>>>>>>> My concerns are the followings.
>>>>>>>
>>>>>>>     1. This kind of esoteric differences are called `vendor-lock-in`
>>>>>>> stuffs in a negative way.
>>>>>>>        - It's difficult for new users to understand.
>>>>>>>        - It's hard to migrate between Apache Spark and the others.
>>>>>>>     2. Although we did our bests, Apache Spark SQL has not been
>>>>>>> enough always.
>>>>>>>     3. We need to do improvement in the future releases.
>>>>>>>
>>>>>>> In short, we can keep 3.0.0-preview behaviors here or revert
>>>>>>> SPARK-28093 in order to keep this vendor-lock-in stuffs for
>>>>>>> backward-compatibility.
>>>>>>>
>>>>>>> What I want is building a consistent point of views in this category
>>>>>>> for the upcoming PR reviews.
>>>>>>>
>>>>>>> If we have a clear policy, we can save future community efforts in
>>>>>>> many ways.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Wenchen Fan <cl...@gmail.com>.

I don't know what's the best way to deprecate an SQL function. Runtime
warning can be annoying if it keeps coming out. Maybe we should only log
the warning once per Spark application.

On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thank you for feedback, Wenchen, Maxim, and Takeshi.
>
>     1. "we would also promote the SQL standard TRIM syntax"
>     2. "we could output a warning with a notice about the order of
> parameters"
>     3. "it looks nice to make these (two parameters) trim functions
> unrecommended in future releases"
>
> Yes, in case of reverting SPARK-28093, we had better deprecate these
> two-parameter SQL function invocations. It's because this is really
> esoteric and even inconsistent inside Spark (Scala API also follows
> PostgresSQL/Presto style like the following.)
>
>     def trim(e: Column, trimString: String)
>
> If we keep this situation in 3.0.0 release (a major release), it means
> Apache Spark will be forever.
> And, it becomes worse and worse because it leads more users to fall into
> this trap.
>
> There are a few deprecation ways. I believe (3) can be a proper next step
> in case of reverting because (2) is infeasible and (1) is considered
> insufficient because it's silent when we do SPARK-28093. We need non-silent
> (noisy) one in this case. Technically, (3) can be done in
> `Analyzer.ResolveFunctions`.
>
>     1. Documentation-only: Removing example and add migration guide
>     2. Compile-time warning by annotation: This is not an option for SQL
> function in SQL string.
>     3. Runtime warning with a directional guide
>        (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")
>
> How do you think about (3)?
>
> Bests,
> Dongjoon.
>
> On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> The revert looks fine to me for keeping the compatibility.
>> Also, I think the different orders between the systems easily lead to
>> mistakes, so
>> , as Wenchen suggested, it looks nice to make these (two parameters) trim
>> functions
>> unrecommended in future releases:
>> https://github.com/apache/spark/pull/27540#discussion_r377682518
>> Actually, I think the SQL TRIM syntax is enough for trim use cases...
>>
>> Bests,
>> Takeshi
>>
>>
>> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <ma...@databricks.com>
>> wrote:
>>
>>> Also if we look at possible combination of trim parameters:
>>> 1. foldable srcStr + foldable trimStr
>>> 2. foldable srcStr + non-foldable trimStr
>>> 3. non-foldable srcStr + foldable trimStr
>>> 4. non-foldable srcStr + non-foldable trimStr
>>>
>>> The case # 2 seems a rare case, and # 3 is probably the most common
>>> case. Once we see the second case, we could output a warning with a notice
>>> about the order of parameters.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>>
>>> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> It's unfortunate that we don't have a clear document to talk about
>>>> breaking changes (I'm working on it BTW). I believe the general guidance
>>>> is: *avoid breaking changes unless we have to*. For example, the
>>>> previous result was so broken that we have to fix it, moving to Scala 2.12
>>>> makes us have to break some APIs, etc.
>>>>
>>>> For this particular case, do we have to switch the parameter order?
>>>> It's different from some systems, the order was not decided explicitly, but
>>>> I don't think they are strong reasons. People from RDBMS should use the SQL
>>>> standard TRIM syntax more often. People using prior Spark versions should
>>>> have figured out the parameter order of Spark TRIM (there was no document)
>>>> and adjusted their queries. There is no such standard that defines the
>>>> parameter order of the TRIM function.
>>>>
>>>> In the long term, we would also promote the SQL standard TRIM syntax. I
>>>> don't see much benefit of "fixing" the parameter order that worth to make a
>>>> breaking change.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>>
>>>>
>>>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters
>>>>> and TRIM(trimStr FROM str) syntax.
>>>>>
>>>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>>>
>>>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <
>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> I'm sending this email because the Apache Spark committers had better
>>>>>> have a consistent point of views for the upcoming PRs. And, the community
>>>>>> policy is the way to lead the community members transparently and clearly
>>>>>> for a long term good.
>>>>>>
>>>>>> First of all, I want to emphasize that, like Apache Spark 2.0.0,
>>>>>> Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>>>>>
>>>>>> Especially, we invested lots of efforts to improve SQL ANSI
>>>>>> compliance and related test coverage for the future. One simple example is
>>>>>> the following.
>>>>>>
>>>>>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>>>>>
>>>>>> With this support, we are able to move away from one of esoteric
>>>>>> Spark function `TRIM/LTRIM/RTRIM`.
>>>>>> (Note that the above syntax is ANSI standard, but we have additional
>>>>>> non-standard these functions, too.)
>>>>>>
>>>>>> Here is the summary of the current status.
>>>>>>
>>>>>>     +------------------------+-------------+---------------+
>>>>>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>>>>>     +------------------------------------------------------+
>>>>>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>>>>>     | PostgreSQL             |      O      |       O       |
>>>>>>     | Presto                 |      O      |       O       |
>>>>>>     | MySQL                  |      O(*)   |       X       |
>>>>>>     | Oracle                 |      O      |       X       |
>>>>>>     | MsSQL                  |      O      |       X       |
>>>>>>     | Terradata              |      O      |       X       |
>>>>>>     | Hive                   |      X      |       X       |
>>>>>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>>>>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>>>>>     +------------------------+-------------+---------------+
>>>>>>     * MySQL doesn't allow multiple trim character
>>>>>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>>>>>
>>>>>> Here is the illustrative example of the problem.
>>>>>>
>>>>>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>>>>>     btrim
>>>>>>     -------
>>>>>>     Tom
>>>>>>
>>>>>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>>>>>     _col0
>>>>>>     -------
>>>>>>     Tom
>>>>>>
>>>>>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>>>>     z
>>>>>>
>>>>>> Here is our history to fix the above issue.
>>>>>>
>>>>>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>>>>> issue
>>>>>>     1. https://github.com/apache/spark/pull/24902
>>>>>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and
>>>>>> 3.0.0-preview2 released.)
>>>>>>     2. https://github.com/apache/spark/pull/24907
>>>>>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>>>>>     3. https://github.com/apache/spark/pull/24908
>>>>>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>>>>>
>>>>>> (2) and (3) were reverted before releases because we didn't want to
>>>>>> fix that in the maintenance releases. Please see the following references
>>>>>> of the decision.
>>>>>>
>>>>>>     https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>>>>> (2.3)
>>>>>>     https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>>>>> (2.4)
>>>>>>
>>>>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>>>>> esoteric functions for backward compatibility and the following reason.
>>>>>>
>>>>>>     > Reordering function parameters to match another system,
>>>>>>     > for a method that is otherwise working correctly,
>>>>>>     > sounds exactly like a cosmetic change to me.
>>>>>>
>>>>>>     > How can we silently change the parameter of an existing SQL
>>>>>> function?
>>>>>>     > I don't think this is a correctness issue as the SQL standard
>>>>>>     > doesn't say that the function signature have to be trim(srcStr,
>>>>>> trimStr).
>>>>>>
>>>>>> The concern and the point of views make sense.
>>>>>>
>>>>>> My concerns are the followings.
>>>>>>
>>>>>>     1. This kind of esoteric differences are called `vendor-lock-in`
>>>>>> stuffs in a negative way.
>>>>>>        - It's difficult for new users to understand.
>>>>>>        - It's hard to migrate between Apache Spark and the others.
>>>>>>     2. Although we did our bests, Apache Spark SQL has not been
>>>>>> enough always.
>>>>>>     3. We need to do improvement in the future releases.
>>>>>>
>>>>>> In short, we can keep 3.0.0-preview behaviors here or revert
>>>>>> SPARK-28093 in order to keep this vendor-lock-in stuffs for
>>>>>> backward-compatibility.
>>>>>>
>>>>>> What I want is building a consistent point of views in this category
>>>>>> for the upcoming PR reviews.
>>>>>>
>>>>>> If we have a clear policy, we can save future community efforts in
>>>>>> many ways.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you for feedback, Wenchen, Maxim, and Takeshi.

    1. "we would also promote the SQL standard TRIM syntax"
    2. "we could output a warning with a notice about the order of
parameters"
    3. "it looks nice to make these (two parameters) trim functions
unrecommended in future releases"

Yes, in case of reverting SPARK-28093, we had better deprecate these
two-parameter SQL function invocations. It's because this is really
esoteric and even inconsistent inside Spark (Scala API also follows
PostgresSQL/Presto style like the following.)

    def trim(e: Column, trimString: String)

If we keep this situation in 3.0.0 release (a major release), it means
Apache Spark will be forever.
And, it becomes worse and worse because it leads more users to fall into
this trap.

There are a few deprecation ways. I believe (3) can be a proper next step
in case of reverting because (2) is infeasible and (1) is considered
insufficient because it's silent when we do SPARK-28093. We need non-silent
(noisy) one in this case. Technically, (3) can be done in
`Analyzer.ResolveFunctions`.

    1. Documentation-only: Removing example and add migration guide
    2. Compile-time warning by annotation: This is not an option for SQL
function in SQL string.
    3. Runtime warning with a directional guide
       (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")

How do you think about (3)?

Bests,
Dongjoon.

On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <li...@gmail.com>
wrote:

> The revert looks fine to me for keeping the compatibility.
> Also, I think the different orders between the systems easily lead to
> mistakes, so
> , as Wenchen suggested, it looks nice to make these (two parameters) trim
> functions
> unrecommended in future releases:
> https://github.com/apache/spark/pull/27540#discussion_r377682518
> Actually, I think the SQL TRIM syntax is enough for trim use cases...
>
> Bests,
> Takeshi
>
>
> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <ma...@databricks.com>
> wrote:
>
>> Also if we look at possible combination of trim parameters:
>> 1. foldable srcStr + foldable trimStr
>> 2. foldable srcStr + non-foldable trimStr
>> 3. non-foldable srcStr + foldable trimStr
>> 4. non-foldable srcStr + non-foldable trimStr
>>
>> The case # 2 seems a rare case, and # 3 is probably the most common case.
>> Once we see the second case, we could output a warning with a notice about
>> the order of parameters.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> It's unfortunate that we don't have a clear document to talk about
>>> breaking changes (I'm working on it BTW). I believe the general guidance
>>> is: *avoid breaking changes unless we have to*. For example, the
>>> previous result was so broken that we have to fix it, moving to Scala 2.12
>>> makes us have to break some APIs, etc.
>>>
>>> For this particular case, do we have to switch the parameter order? It's
>>> different from some systems, the order was not decided explicitly, but I
>>> don't think they are strong reasons. People from RDBMS should use the SQL
>>> standard TRIM syntax more often. People using prior Spark versions should
>>> have figured out the parameter order of Spark TRIM (there was no document)
>>> and adjusted their queries. There is no such standard that defines the
>>> parameter order of the TRIM function.
>>>
>>> In the long term, we would also promote the SQL standard TRIM syntax. I
>>> don't see much benefit of "fixing" the parameter order that worth to make a
>>> breaking change.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>>
>>>
>>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters
>>>> and TRIM(trimStr FROM str) syntax.
>>>>
>>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>>
>>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I'm sending this email because the Apache Spark committers had better
>>>>> have a consistent point of views for the upcoming PRs. And, the community
>>>>> policy is the way to lead the community members transparently and clearly
>>>>> for a long term good.
>>>>>
>>>>> First of all, I want to emphasize that, like Apache Spark 2.0.0,
>>>>> Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>>>>
>>>>> Especially, we invested lots of efforts to improve SQL ANSI compliance
>>>>> and related test coverage for the future. One simple example is the
>>>>> following.
>>>>>
>>>>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>>>>
>>>>> With this support, we are able to move away from one of esoteric Spark
>>>>> function `TRIM/LTRIM/RTRIM`.
>>>>> (Note that the above syntax is ANSI standard, but we have additional
>>>>> non-standard these functions, too.)
>>>>>
>>>>> Here is the summary of the current status.
>>>>>
>>>>>     +------------------------+-------------+---------------+
>>>>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>>>>     +------------------------------------------------------+
>>>>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>>>>     | PostgreSQL             |      O      |       O       |
>>>>>     | Presto                 |      O      |       O       |
>>>>>     | MySQL                  |      O(*)   |       X       |
>>>>>     | Oracle                 |      O      |       X       |
>>>>>     | MsSQL                  |      O      |       X       |
>>>>>     | Terradata              |      O      |       X       |
>>>>>     | Hive                   |      X      |       X       |
>>>>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>>>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>>>>     +------------------------+-------------+---------------+
>>>>>     * MySQL doesn't allow multiple trim character
>>>>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>>>>
>>>>> Here is the illustrative example of the problem.
>>>>>
>>>>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>>>>     btrim
>>>>>     -------
>>>>>     Tom
>>>>>
>>>>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>>>>     _col0
>>>>>     -------
>>>>>     Tom
>>>>>
>>>>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>>>     z
>>>>>
>>>>> Here is our history to fix the above issue.
>>>>>
>>>>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>>>> issue
>>>>>     1. https://github.com/apache/spark/pull/24902
>>>>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
>>>>> released.)
>>>>>     2. https://github.com/apache/spark/pull/24907
>>>>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>>>>     3. https://github.com/apache/spark/pull/24908
>>>>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>>>>
>>>>> (2) and (3) were reverted before releases because we didn't want to
>>>>> fix that in the maintenance releases. Please see the following references
>>>>> of the decision.
>>>>>
>>>>>     https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>>>> (2.3)
>>>>>     https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>>>> (2.4)
>>>>>
>>>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>>>> esoteric functions for backward compatibility and the following reason.
>>>>>
>>>>>     > Reordering function parameters to match another system,
>>>>>     > for a method that is otherwise working correctly,
>>>>>     > sounds exactly like a cosmetic change to me.
>>>>>
>>>>>     > How can we silently change the parameter of an existing SQL
>>>>> function?
>>>>>     > I don't think this is a correctness issue as the SQL standard
>>>>>     > doesn't say that the function signature have to be trim(srcStr,
>>>>> trimStr).
>>>>>
>>>>> The concern and the point of views make sense.
>>>>>
>>>>> My concerns are the followings.
>>>>>
>>>>>     1. This kind of esoteric differences are called `vendor-lock-in`
>>>>> stuffs in a negative way.
>>>>>        - It's difficult for new users to understand.
>>>>>        - It's hard to migrate between Apache Spark and the others.
>>>>>     2. Although we did our bests, Apache Spark SQL has not been enough
>>>>> always.
>>>>>     3. We need to do improvement in the future releases.
>>>>>
>>>>> In short, we can keep 3.0.0-preview behaviors here or revert
>>>>> SPARK-28093 in order to keep this vendor-lock-in stuffs for
>>>>> backward-compatibility.
>>>>>
>>>>> What I want is building a consistent point of views in this category
>>>>> for the upcoming PR reviews.
>>>>>
>>>>> If we have a clear policy, we can save future community efforts in
>>>>> many ways.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Takeshi Yamamuro <li...@gmail.com>.

The revert looks fine to me for keeping the compatibility.
Also, I think the different orders between the systems easily lead to
mistakes, so
, as Wenchen suggested, it looks nice to make these (two parameters) trim
functions
unrecommended in future releases:
https://github.com/apache/spark/pull/27540#discussion_r377682518
Actually, I think the SQL TRIM syntax is enough for trim use cases...

Bests,
Takeshi


On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <ma...@databricks.com>
wrote:

> Also if we look at possible combination of trim parameters:
> 1. foldable srcStr + foldable trimStr
> 2. foldable srcStr + non-foldable trimStr
> 3. non-foldable srcStr + foldable trimStr
> 4. non-foldable srcStr + non-foldable trimStr
>
> The case # 2 seems a rare case, and # 3 is probably the most common case.
> Once we see the second case, we could output a warning with a notice about
> the order of parameters.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> It's unfortunate that we don't have a clear document to talk about
>> breaking changes (I'm working on it BTW). I believe the general guidance
>> is: *avoid breaking changes unless we have to*. For example, the
>> previous result was so broken that we have to fix it, moving to Scala 2.12
>> makes us have to break some APIs, etc.
>>
>> For this particular case, do we have to switch the parameter order? It's
>> different from some systems, the order was not decided explicitly, but I
>> don't think they are strong reasons. People from RDBMS should use the SQL
>> standard TRIM syntax more often. People using prior Spark versions should
>> have figured out the parameter order of Spark TRIM (there was no document)
>> and adjusted their queries. There is no such standard that defines the
>> parameter order of the TRIM function.
>>
>> In the long term, we would also promote the SQL standard TRIM syntax. I
>> don't see much benefit of "fixing" the parameter order that worth to make a
>> breaking change.
>>
>> Thanks,
>> Wenchen
>>
>>
>>
>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and
>>> TRIM(trimStr FROM str) syntax.
>>>
>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>
>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> I'm sending this email because the Apache Spark committers had better
>>>> have a consistent point of views for the upcoming PRs. And, the community
>>>> policy is the way to lead the community members transparently and clearly
>>>> for a long term good.
>>>>
>>>> First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
>>>> Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>>>
>>>> Especially, we invested lots of efforts to improve SQL ANSI compliance
>>>> and related test coverage for the future. One simple example is the
>>>> following.
>>>>
>>>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>>>
>>>> With this support, we are able to move away from one of esoteric Spark
>>>> function `TRIM/LTRIM/RTRIM`.
>>>> (Note that the above syntax is ANSI standard, but we have additional
>>>> non-standard these functions, too.)
>>>>
>>>> Here is the summary of the current status.
>>>>
>>>>     +------------------------+-------------+---------------+
>>>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>>>     +------------------------------------------------------+
>>>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>>>     | PostgreSQL             |      O      |       O       |
>>>>     | Presto                 |      O      |       O       |
>>>>     | MySQL                  |      O(*)   |       X       |
>>>>     | Oracle                 |      O      |       X       |
>>>>     | MsSQL                  |      O      |       X       |
>>>>     | Terradata              |      O      |       X       |
>>>>     | Hive                   |      X      |       X       |
>>>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>>>     +------------------------+-------------+---------------+
>>>>     * MySQL doesn't allow multiple trim character
>>>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>>>
>>>> Here is the illustrative example of the problem.
>>>>
>>>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>>>     btrim
>>>>     -------
>>>>     Tom
>>>>
>>>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>>>     _col0
>>>>     -------
>>>>     Tom
>>>>
>>>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>>     z
>>>>
>>>> Here is our history to fix the above issue.
>>>>
>>>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>>> issue
>>>>     1. https://github.com/apache/spark/pull/24902
>>>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
>>>> released.)
>>>>     2. https://github.com/apache/spark/pull/24907
>>>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>>>     3. https://github.com/apache/spark/pull/24908
>>>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>>>
>>>> (2) and (3) were reverted before releases because we didn't want to fix
>>>> that in the maintenance releases. Please see the following references of
>>>> the decision.
>>>>
>>>>     https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>>> (2.3)
>>>>     https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>>> (2.4)
>>>>
>>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>>> esoteric functions for backward compatibility and the following reason.
>>>>
>>>>     > Reordering function parameters to match another system,
>>>>     > for a method that is otherwise working correctly,
>>>>     > sounds exactly like a cosmetic change to me.
>>>>
>>>>     > How can we silently change the parameter of an existing SQL
>>>> function?
>>>>     > I don't think this is a correctness issue as the SQL standard
>>>>     > doesn't say that the function signature have to be trim(srcStr,
>>>> trimStr).
>>>>
>>>> The concern and the point of views make sense.
>>>>
>>>> My concerns are the followings.
>>>>
>>>>     1. This kind of esoteric differences are called `vendor-lock-in`
>>>> stuffs in a negative way.
>>>>        - It's difficult for new users to understand.
>>>>        - It's hard to migrate between Apache Spark and the others.
>>>>     2. Although we did our bests, Apache Spark SQL has not been enough
>>>> always.
>>>>     3. We need to do improvement in the future releases.
>>>>
>>>> In short, we can keep 3.0.0-preview behaviors here or revert
>>>> SPARK-28093 in order to keep this vendor-lock-in stuffs for
>>>> backward-compatibility.
>>>>
>>>> What I want is building a consistent point of views in this category
>>>> for the upcoming PR reviews.
>>>>
>>>> If we have a clear policy, we can save future community efforts in many
>>>> ways.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>

-- 
---
Takeshi Yamamuro

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Maxim Gekk <ma...@databricks.com>.

Also if we look at possible combination of trim parameters:
1. foldable srcStr + foldable trimStr
2. foldable srcStr + non-foldable trimStr
3. non-foldable srcStr + foldable trimStr
4. non-foldable srcStr + non-foldable trimStr

The case # 2 seems a rare case, and # 3 is probably the most common case.
Once we see the second case, we could output a warning with a notice about
the order of parameters.

Maxim Gekk

Software Engineer

Databricks, Inc.


On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <cl...@gmail.com> wrote:

> It's unfortunate that we don't have a clear document to talk about
> breaking changes (I'm working on it BTW). I believe the general guidance
> is: *avoid breaking changes unless we have to*. For example, the previous
> result was so broken that we have to fix it, moving to Scala 2.12 makes us
> have to break some APIs, etc.
>
> For this particular case, do we have to switch the parameter order? It's
> different from some systems, the order was not decided explicitly, but I
> don't think they are strong reasons. People from RDBMS should use the SQL
> standard TRIM syntax more often. People using prior Spark versions should
> have figured out the parameter order of Spark TRIM (there was no document)
> and adjusted their queries. There is no such standard that defines the
> parameter order of the TRIM function.
>
> In the long term, we would also promote the SQL standard TRIM syntax. I
> don't see much benefit of "fixing" the parameter order that worth to make a
> breaking change.
>
> Thanks,
> Wenchen
>
>
>
> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and
>> TRIM(trimStr FROM str) syntax.
>>
>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>
>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> I'm sending this email because the Apache Spark committers had better
>>> have a consistent point of views for the upcoming PRs. And, the community
>>> policy is the way to lead the community members transparently and clearly
>>> for a long term good.
>>>
>>> First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
>>> Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>>
>>> Especially, we invested lots of efforts to improve SQL ANSI compliance
>>> and related test coverage for the future. One simple example is the
>>> following.
>>>
>>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>>
>>> With this support, we are able to move away from one of esoteric Spark
>>> function `TRIM/LTRIM/RTRIM`.
>>> (Note that the above syntax is ANSI standard, but we have additional
>>> non-standard these functions, too.)
>>>
>>> Here is the summary of the current status.
>>>
>>>     +------------------------+-------------+---------------+
>>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>>     +------------------------------------------------------+
>>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>>     | PostgreSQL             |      O      |       O       |
>>>     | Presto                 |      O      |       O       |
>>>     | MySQL                  |      O(*)   |       X       |
>>>     | Oracle                 |      O      |       X       |
>>>     | MsSQL                  |      O      |       X       |
>>>     | Terradata              |      O      |       X       |
>>>     | Hive                   |      X      |       X       |
>>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>>     +------------------------+-------------+---------------+
>>>     * MySQL doesn't allow multiple trim character
>>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>>
>>> Here is the illustrative example of the problem.
>>>
>>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>>     btrim
>>>     -------
>>>     Tom
>>>
>>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>>     _col0
>>>     -------
>>>     Tom
>>>
>>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>     z
>>>
>>> Here is our history to fix the above issue.
>>>
>>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>> issue
>>>     1. https://github.com/apache/spark/pull/24902
>>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
>>> released.)
>>>     2. https://github.com/apache/spark/pull/24907
>>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>>     3. https://github.com/apache/spark/pull/24908
>>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>>
>>> (2) and (3) were reverted before releases because we didn't want to fix
>>> that in the maintenance releases. Please see the following references of
>>> the decision.
>>>
>>>     https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>> (2.3)
>>>     https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>> (2.4)
>>>
>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>> esoteric functions for backward compatibility and the following reason.
>>>
>>>     > Reordering function parameters to match another system,
>>>     > for a method that is otherwise working correctly,
>>>     > sounds exactly like a cosmetic change to me.
>>>
>>>     > How can we silently change the parameter of an existing SQL
>>> function?
>>>     > I don't think this is a correctness issue as the SQL standard
>>>     > doesn't say that the function signature have to be trim(srcStr,
>>> trimStr).
>>>
>>> The concern and the point of views make sense.
>>>
>>> My concerns are the followings.
>>>
>>>     1. This kind of esoteric differences are called `vendor-lock-in`
>>> stuffs in a negative way.
>>>        - It's difficult for new users to understand.
>>>        - It's hard to migrate between Apache Spark and the others.
>>>     2. Although we did our bests, Apache Spark SQL has not been enough
>>> always.
>>>     3. We need to do improvement in the future releases.
>>>
>>> In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093
>>> in order to keep this vendor-lock-in stuffs for backward-compatibility.
>>>
>>> What I want is building a consistent point of views in this category for
>>> the upcoming PR reviews.
>>>
>>> If we have a clear policy, we can save future community efforts in many
>>> ways.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Wenchen Fan <cl...@gmail.com>.

It's unfortunate that we don't have a clear document to talk about breaking
changes (I'm working on it BTW). I believe the general guidance is: *avoid
breaking changes unless we have to*. For example, the previous result was
so broken that we have to fix it, moving to Scala 2.12 makes us have to
break some APIs, etc.

For this particular case, do we have to switch the parameter order? It's
different from some systems, the order was not decided explicitly, but I
don't think they are strong reasons. People from RDBMS should use the SQL
standard TRIM syntax more often. People using prior Spark versions should
have figured out the parameter order of Spark TRIM (there was no document)
and adjusted their queries. There is no such standard that defines the
parameter order of the TRIM function.

In the long term, we would also promote the SQL standard TRIM syntax. I
don't see much benefit of "fixing" the parameter order that worth to make a
breaking change.

Thanks,
Wenchen



On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and
> TRIM(trimStr FROM str) syntax.
>
> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>
> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> I'm sending this email because the Apache Spark committers had better
>> have a consistent point of views for the upcoming PRs. And, the community
>> policy is the way to lead the community members transparently and clearly
>> for a long term good.
>>
>> First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
>> Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>
>> Especially, we invested lots of efforts to improve SQL ANSI compliance
>> and related test coverage for the future. One simple example is the
>> following.
>>
>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>
>> With this support, we are able to move away from one of esoteric Spark
>> function `TRIM/LTRIM/RTRIM`.
>> (Note that the above syntax is ANSI standard, but we have additional
>> non-standard these functions, too.)
>>
>> Here is the summary of the current status.
>>
>>     +------------------------+-------------+---------------+
>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>     +------------------------------------------------------+
>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>     | PostgreSQL             |      O      |       O       |
>>     | Presto                 |      O      |       O       |
>>     | MySQL                  |      O(*)   |       X       |
>>     | Oracle                 |      O      |       X       |
>>     | MsSQL                  |      O      |       X       |
>>     | Terradata              |      O      |       X       |
>>     | Hive                   |      X      |       X       |
>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>     +------------------------+-------------+---------------+
>>     * MySQL doesn't allow multiple trim character
>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>
>> Here is the illustrative example of the problem.
>>
>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>     btrim
>>     -------
>>     Tom
>>
>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>     _col0
>>     -------
>>     Tom
>>
>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>     z
>>
>> Here is our history to fix the above issue.
>>
>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
>>     1. https://github.com/apache/spark/pull/24902
>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
>> released.)
>>     2. https://github.com/apache/spark/pull/24907
>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>     3. https://github.com/apache/spark/pull/24908
>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>
>> (2) and (3) were reverted before releases because we didn't want to fix
>> that in the maintenance releases. Please see the following references of
>> the decision.
>>
>>     https://github.com/apache/spark/pull/24908#issuecomment-504799028
>> (2.3)
>>     https://github.com/apache/spark/pull/24907#issuecomment-504799021
>> (2.4)
>>
>> Now, there are some requests to revert SPARK-28093 and to keep these
>> esoteric functions for backward compatibility and the following reason.
>>
>>     > Reordering function parameters to match another system,
>>     > for a method that is otherwise working correctly,
>>     > sounds exactly like a cosmetic change to me.
>>
>>     > How can we silently change the parameter of an existing SQL
>> function?
>>     > I don't think this is a correctness issue as the SQL standard
>>     > doesn't say that the function signature have to be trim(srcStr,
>> trimStr).
>>
>> The concern and the point of views make sense.
>>
>> My concerns are the followings.
>>
>>     1. This kind of esoteric differences are called `vendor-lock-in`
>> stuffs in a negative way.
>>        - It's difficult for new users to understand.
>>        - It's hard to migrate between Apache Spark and the others.
>>     2. Although we did our bests, Apache Spark SQL has not been enough
>> always.
>>     3. We need to do improvement in the future releases.
>>
>> In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093
>> in order to keep this vendor-lock-in stuffs for backward-compatibility.
>>
>> What I want is building a consistent point of views in this category for
>> the upcoming PR reviews.
>>
>> If we have a clear policy, we can save future community efforts in many
>> ways.
>>
>> Bests,
>> Dongjoon.
>>
>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Posted by Dongjoon Hyun <do...@gmail.com>.

Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and
TRIM(trimStr FROM str) syntax.

This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.

On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> I'm sending this email because the Apache Spark committers had better have
> a consistent point of views for the upcoming PRs. And, the community policy
> is the way to lead the community members transparently and clearly for a
> long term good.
>
> First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
> Spark 3.0.0 is going to achieve many improvement in SQL areas.
>
> Especially, we invested lots of efforts to improve SQL ANSI compliance and
> related test coverage for the future. One simple example is the following.
>
>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>
> With this support, we are able to move away from one of esoteric Spark
> function `TRIM/LTRIM/RTRIM`.
> (Note that the above syntax is ANSI standard, but we have additional
> non-standard these functions, too.)
>
> Here is the summary of the current status.
>
>     +------------------------+-------------+---------------+
>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>     +------------------------------------------------------+
>     | Spark 3.0.0-preview/2  |      O      |       O       |
>     | PostgreSQL             |      O      |       O       |
>     | Presto                 |      O      |       O       |
>     | MySQL                  |      O(*)   |       X       |
>     | Oracle                 |      O      |       X       |
>     | MsSQL                  |      O      |       X       |
>     | Terradata              |      O      |       X       |
>     | Hive                   |      X      |       X       |
>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>     +------------------------+-------------+---------------+
>     * MySQL doesn't allow multiple trim character
>     * Spark 2.3 ~ 2.4 have the function in a different way.
>
> Here is the illustrative example of the problem.
>
>     postgres=# SELECT trim('yxTomxx', 'xyz');
>     btrim
>     -------
>     Tom
>
>     presto:default> SELECT trim('yxTomxx', 'xyz');
>     _col0
>     -------
>     Tom
>
>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>     z
>
> Here is our history to fix the above issue.
>
>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue
>     1. https://github.com/apache/spark/pull/24902
>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
> released.)
>     2. https://github.com/apache/spark/pull/24907
>        (Merged 2019-06-20 for v2.3.4, but reverted)
>     3. https://github.com/apache/spark/pull/24908
>        (Merged 2019-06-21 for v2.4.4, but reverted)
>
> (2) and (3) were reverted before releases because we didn't want to fix
> that in the maintenance releases. Please see the following references of
> the decision.
>
>     https://github.com/apache/spark/pull/24908#issuecomment-504799028
> (2.3)
>     https://github.com/apache/spark/pull/24907#issuecomment-504799021
> (2.4)
>
> Now, there are some requests to revert SPARK-28093 and to keep these
> esoteric functions for backward compatibility and the following reason.
>
>     > Reordering function parameters to match another system,
>     > for a method that is otherwise working correctly,
>     > sounds exactly like a cosmetic change to me.
>
>     > How can we silently change the parameter of an existing SQL function?
>     > I don't think this is a correctness issue as the SQL standard
>     > doesn't say that the function signature have to be trim(srcStr,
> trimStr).
>
> The concern and the point of views make sense.
>
> My concerns are the followings.
>
>     1. This kind of esoteric differences are called `vendor-lock-in`
> stuffs in a negative way.
>        - It's difficult for new users to understand.
>        - It's hard to migrate between Apache Spark and the others.
>     2. Although we did our bests, Apache Spark SQL has not been enough
> always.
>     3. We need to do improvement in the future releases.
>
> In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093
> in order to keep this vendor-lock-in stuffs for backward-compatibility.
>
> What I want is building a consistent point of views in this category for
> the upcoming PR reviews.
>
> If we have a clear policy, we can save future community efforts in many
> ways.
>
> Bests,
> Dongjoon.
>