You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2024/04/12 02:54:44 UTC

[DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Hi, All.

Thanks to you, we've been achieving many things and have on-going SPIPs.
I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
by asking your opinions about Apache Spark's ANSI SQL mode.

    https://issues.apache.org/jira/browse/SPARK-44111
    Prepare Apache Spark 4.0.0

SPARK-44444 was proposed last year (on 15/Jul/23) as the one of desirable
items for 4.0.0 because it's a big behavior.

    https://issues.apache.org/jira/browse/SPARK-44444
    Use ANSI SQL mode by default

Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
been aiming to provide a better Spark SQL compatibility in a standard way.
We also have a daily CI to protect the behavior too.

    https://github.com/apache/spark/actions/workflows/build_ansi.yml

However, it's still behind the configuration with several known issues,
e.g.,

    SPARK-41794 Reenable ANSI mode in test_connect_column
    SPARK-41547 Reenable ANSI mode in test_connect_functions
    SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard

To be clear, we know that many DBMSes have their own implementations of
SQL standard and not the same. Like them, SPARK-44444 aims to enable
only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
There is nothing more than that.

In other words, the current Spark ANSI SQL implementation becomes the first
implementation for Spark SQL users to face at first while providing
`spark.sql.ansi.enabled=false` in the same way without losing any
capability.

If we don't want this change for some reasons, we can simply exclude
SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
It's time just to make a go/no-go decision for this item for the global
optimization
for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
for this again for the next four years until 2028.

WDYT?

Bests,
Dongjoon

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Posted by "serge rielau.com" <se...@rielau.com>.

+1 it‘s the wrapping on math overflows that does it for me.

Sent from my iPhone

On Apr 12, 2024, at 9:36 AM, huaxin gao <hu...@gmail.com> wrote:


+1

On Thu, Apr 11, 2024 at 11:18 PM L. C. Hsieh <vi...@gmail.com>> wrote:
+1

I believe ANSI mode is well developed after many releases. No doubt it
could be used.
Since it is very easy to disable it to restore to current behavior, I
guess the impact could be limited.
Do we have known the possible impacts such as what are the major
changes (e.g., what kind of queries/expressions will fail)? We can
describe them in the release note.

On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang <lt...@gmail.com>> wrote:
>
>
> +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance data quality and integrity. I fully support this initiative.
>
> > In other words, the current Spark ANSI SQL implementation becomes the first implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any capability.`spark.sql.ansi.enabled=false` in the same way without losing any capability.
>
> BTW, the try_* functions and SQL Error Attribution Framework will also be beneficial in migrating to ANSI SQL mode.
>
>
> Gengliang
>
>
> On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <do...@gmail.com>> wrote:
>>
>> Hi, All.
>>
>> Thanks to you, we've been achieving many things and have on-going SPIPs.
>> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
>> by asking your opinions about Apache Spark's ANSI SQL mode.
>>
>>     https://issues.apache.org/jira/browse/SPARK-44111
>>     Prepare Apache Spark 4.0.0
>>
>> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of desirable
>> items for 4.0.0 because it's a big behavior.
>>
>>     https://issues.apache.org/jira/browse/SPARK-44444
>>     Use ANSI SQL mode by default
>>
>> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
>> been aiming to provide a better Spark SQL compatibility in a standard way.
>> We also have a daily CI to protect the behavior too.
>>
>>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
>>
>> However, it's still behind the configuration with several known issues, e.g.,
>>
>>     SPARK-41794 Reenable ANSI mode in test_connect_column
>>     SPARK-41547 Reenable ANSI mode in test_connect_functions
>>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>>
>> To be clear, we know that many DBMSes have their own implementations of
>> SQL standard and not the same. Like them, SPARK-44444 aims to enable
>> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
>> There is nothing more than that.
>>
>> In other words, the current Spark ANSI SQL implementation becomes the first
>> implementation for Spark SQL users to face at first while providing
>> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
>>
>> If we don't want this change for some reasons, we can simply exclude
>> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
>> It's time just to make a go/no-go decision for this item for the global optimization
>> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
>> for this again for the next four years until 2028.
>>
>> WDYT?
>>
>> Bests,
>> Dongjoon

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Posted by huaxin gao <hu...@gmail.com>.

+1

On Thu, Apr 11, 2024 at 11:18 PM L. C. Hsieh <vi...@gmail.com> wrote:

> +1
>
> I believe ANSI mode is well developed after many releases. No doubt it
> could be used.
> Since it is very easy to disable it to restore to current behavior, I
> guess the impact could be limited.
> Do we have known the possible impacts such as what are the major
> changes (e.g., what kind of queries/expressions will fail)? We can
> describe them in the release note.
>
> On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang <lt...@gmail.com> wrote:
> >
> >
> > +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly
> enhance data quality and integrity. I fully support this initiative.
> >
> > > In other words, the current Spark ANSI SQL implementation becomes the
> first implementation for Spark SQL users to face at first while providing
> > `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.`spark.sql.ansi.enabled=false` in the same way without losing
> any capability.
> >
> > BTW, the try_* functions and SQL Error Attribution Framework will also
> be beneficial in migrating to ANSI SQL mode.
> >
> >
> > Gengliang
> >
> >
> > On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >>
> >> Hi, All.
> >>
> >> Thanks to you, we've been achieving many things and have on-going SPIPs.
> >> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more
> narrowly
> >> by asking your opinions about Apache Spark's ANSI SQL mode.
> >>
> >>     https://issues.apache.org/jira/browse/SPARK-44111
> >>     Prepare Apache Spark 4.0.0
> >>
> >> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of
> desirable
> >> items for 4.0.0 because it's a big behavior.
> >>
> >>     https://issues.apache.org/jira/browse/SPARK-44444
> >>     Use ANSI SQL mode by default
> >>
> >> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0
> and has
> >> been aiming to provide a better Spark SQL compatibility in a standard
> way.
> >> We also have a daily CI to protect the behavior too.
> >>
> >>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
> >>
> >> However, it's still behind the configuration with several known issues,
> e.g.,
> >>
> >>     SPARK-41794 Reenable ANSI mode in test_connect_column
> >>     SPARK-41547 Reenable ANSI mode in test_connect_functions
> >>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
> >>
> >> To be clear, we know that many DBMSes have their own implementations of
> >> SQL standard and not the same. Like them, SPARK-44444 aims to enable
> >> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
> >> There is nothing more than that.
> >>
> >> In other words, the current Spark ANSI SQL implementation becomes the
> first
> >> implementation for Spark SQL users to face at first while providing
> >> `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.
> >>
> >> If we don't want this change for some reasons, we can simply exclude
> >> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0
> preparation.
> >> It's time just to make a go/no-go decision for this item for the global
> optimization
> >> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
> >> for this again for the next four years until 2028.
> >>
> >> WDYT?
> >>
> >> Bests,
> >> Dongjoon
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Posted by Wenchen Fan <cl...@gmail.com>.

+1, the existing "NULL on error" behavior is terrible for data quality.

I have one concern about error reporting with DataFrame APIs. Query
execution is lazy and where the error happens can be far away from where
the dataframe/column was created. We are improving it (PR
<https://github.com/apache/spark/pull/45377>) but it's not fully done yet.

On Fri, Apr 12, 2024 at 2:21 PM L. C. Hsieh <vi...@gmail.com> wrote:

> +1
>
> I believe ANSI mode is well developed after many releases. No doubt it
> could be used.
> Since it is very easy to disable it to restore to current behavior, I
> guess the impact could be limited.
> Do we have known the possible impacts such as what are the major
> changes (e.g., what kind of queries/expressions will fail)? We can
> describe them in the release note.
>
> On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang <lt...@gmail.com> wrote:
> >
> >
> > +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly
> enhance data quality and integrity. I fully support this initiative.
> >
> > > In other words, the current Spark ANSI SQL implementation becomes the
> first implementation for Spark SQL users to face at first while providing
> > `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.`spark.sql.ansi.enabled=false` in the same way without losing
> any capability.
> >
> > BTW, the try_* functions and SQL Error Attribution Framework will also
> be beneficial in migrating to ANSI SQL mode.
> >
> >
> > Gengliang
> >
> >
> > On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >>
> >> Hi, All.
> >>
> >> Thanks to you, we've been achieving many things and have on-going SPIPs.
> >> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more
> narrowly
> >> by asking your opinions about Apache Spark's ANSI SQL mode.
> >>
> >>     https://issues.apache.org/jira/browse/SPARK-44111
> >>     Prepare Apache Spark 4.0.0
> >>
> >> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of
> desirable
> >> items for 4.0.0 because it's a big behavior.
> >>
> >>     https://issues.apache.org/jira/browse/SPARK-44444
> >>     Use ANSI SQL mode by default
> >>
> >> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0
> and has
> >> been aiming to provide a better Spark SQL compatibility in a standard
> way.
> >> We also have a daily CI to protect the behavior too.
> >>
> >>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
> >>
> >> However, it's still behind the configuration with several known issues,
> e.g.,
> >>
> >>     SPARK-41794 Reenable ANSI mode in test_connect_column
> >>     SPARK-41547 Reenable ANSI mode in test_connect_functions
> >>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
> >>
> >> To be clear, we know that many DBMSes have their own implementations of
> >> SQL standard and not the same. Like them, SPARK-44444 aims to enable
> >> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
> >> There is nothing more than that.
> >>
> >> In other words, the current Spark ANSI SQL implementation becomes the
> first
> >> implementation for Spark SQL users to face at first while providing
> >> `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.
> >>
> >> If we don't want this change for some reasons, we can simply exclude
> >> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0
> preparation.
> >> It's time just to make a go/no-go decision for this item for the global
> optimization
> >> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
> >> for this again for the next four years until 2028.
> >>
> >> WDYT?
> >>
> >> Bests,
> >> Dongjoon
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Posted by "L. C. Hsieh" <vi...@gmail.com>.

+1

I believe ANSI mode is well developed after many releases. No doubt it
could be used.
Since it is very easy to disable it to restore to current behavior, I
guess the impact could be limited.
Do we have known the possible impacts such as what are the major
changes (e.g., what kind of queries/expressions will fail)? We can
describe them in the release note.

On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang <lt...@gmail.com> wrote:
>
>
> +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance data quality and integrity. I fully support this initiative.
>
> > In other words, the current Spark ANSI SQL implementation becomes the first implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any capability.`spark.sql.ansi.enabled=false` in the same way without losing any capability.
>
> BTW, the try_* functions and SQL Error Attribution Framework will also be beneficial in migrating to ANSI SQL mode.
>
>
> Gengliang
>
>
> On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <do...@gmail.com> wrote:
>>
>> Hi, All.
>>
>> Thanks to you, we've been achieving many things and have on-going SPIPs.
>> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
>> by asking your opinions about Apache Spark's ANSI SQL mode.
>>
>>     https://issues.apache.org/jira/browse/SPARK-44111
>>     Prepare Apache Spark 4.0.0
>>
>> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of desirable
>> items for 4.0.0 because it's a big behavior.
>>
>>     https://issues.apache.org/jira/browse/SPARK-44444
>>     Use ANSI SQL mode by default
>>
>> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
>> been aiming to provide a better Spark SQL compatibility in a standard way.
>> We also have a daily CI to protect the behavior too.
>>
>>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
>>
>> However, it's still behind the configuration with several known issues, e.g.,
>>
>>     SPARK-41794 Reenable ANSI mode in test_connect_column
>>     SPARK-41547 Reenable ANSI mode in test_connect_functions
>>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>>
>> To be clear, we know that many DBMSes have their own implementations of
>> SQL standard and not the same. Like them, SPARK-44444 aims to enable
>> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
>> There is nothing more than that.
>>
>> In other words, the current Spark ANSI SQL implementation becomes the first
>> implementation for Spark SQL users to face at first while providing
>> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
>>
>> If we don't want this change for some reasons, we can simply exclude
>> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
>> It's time just to make a go/no-go decision for this item for the global optimization
>> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
>> for this again for the next four years until 2028.
>>
>> WDYT?
>>
>> Bests,
>> Dongjoon

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Posted by Dongjoon Hyun <do...@apache.org>.

Thank you for your opinions, Gangling, Liang-Chi, Wenchen, Huaxin, Serge, Nicholas.

To Nicholas, Apache Spark community already decided not to pursuit PostgreSQL dialect.

>  I’m flagging this since Spark’s behavior differs in these cases from Postgres,
> as described in the ticket.

Please see the following thread (November 26, 2019).

https://lists.apache.org/thread/v1fx1wkxh5sp6odjcyohppr5x67cyrov
[DISCUSS] PostgreSQL dialect

Given the AS-IS consensus, I'll proceed to start a vote for this topic.

Thanks,
Dongjoon.

On 2024/04/12 17:31:49 Nicholas Chammas wrote:
> This is a side issue, but I’d like to bring people’s attention to SPARK-28024. 
> 
> Cases 2, 3, and 4 described in that ticket are still problems today on master (I just rechecked) even with ANSI mode enabled.
> 
> Well, maybe not problems, but I’m flagging this since Spark’s behavior differs in these cases from Postgres, as described in the ticket.
> 
> 
> > On Apr 12, 2024, at 12:09 AM, Gengliang Wang <lt...@gmail.com> wrote:
> > 
> > 
> > +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance data quality and integrity. I fully support this initiative.
> > 
> > > In other words, the current Spark ANSI SQL implementation becomes the first implementation for Spark SQL users to face at first while providing
> > `spark.sql.ansi.enabled=false` in the same way without losing any capability.`spark.sql.ansi.enabled=false` in the same way without losing any capability.
> > 
> > BTW, the try_* <https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#useful-functions-for-ansi-mode> functions and SQL Error Attribution Framework <https://issues.apache.org/jira/browse/SPARK-38615> will also be beneficial in migrating to ANSI SQL mode.
> > 
> > 
> > Gengliang
> > 
> > 
> > On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
> >> Hi, All.
> >> 
> >> Thanks to you, we've been achieving many things and have on-going SPIPs.
> >> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
> >> by asking your opinions about Apache Spark's ANSI SQL mode.
> >> 
> >>     https://issues.apache.org/jira/browse/SPARK-44111
> >>     Prepare Apache Spark 4.0.0
> >> 
> >> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of desirable
> >> items for 4.0.0 because it's a big behavior.
> >> 
> >>     https://issues.apache.org/jira/browse/SPARK-44444
> >>     Use ANSI SQL mode by default
> >> 
> >> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
> >> been aiming to provide a better Spark SQL compatibility in a standard way.
> >> We also have a daily CI to protect the behavior too.
> >> 
> >>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
> >> 
> >> However, it's still behind the configuration with several known issues, e.g.,
> >> 
> >>     SPARK-41794 Reenable ANSI mode in test_connect_column
> >>     SPARK-41547 Reenable ANSI mode in test_connect_functions
> >>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
> >> 
> >> To be clear, we know that many DBMSes have their own implementations of
> >> SQL standard and not the same. Like them, SPARK-44444 aims to enable
> >> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
> >> There is nothing more than that.
> >> 
> >> In other words, the current Spark ANSI SQL implementation becomes the first
> >> implementation for Spark SQL users to face at first while providing
> >> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
> >> 
> >> If we don't want this change for some reasons, we can simply exclude
> >> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
> >> It's time just to make a go/no-go decision for this item for the global optimization
> >> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
> >> for this again for the next four years until 2028.
> >> 
> >> WDYT?
> >> 
> >> Bests,
> >> Dongjoon
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Posted by Nicholas Chammas <ni...@gmail.com>.

This is a side issue, but I’d like to bring people’s attention to SPARK-28024. 

Cases 2, 3, and 4 described in that ticket are still problems today on master (I just rechecked) even with ANSI mode enabled.

Well, maybe not problems, but I’m flagging this since Spark’s behavior differs in these cases from Postgres, as described in the ticket.


> On Apr 12, 2024, at 12:09 AM, Gengliang Wang <lt...@gmail.com> wrote:
> 
> 
> +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance data quality and integrity. I fully support this initiative.
> 
> > In other words, the current Spark ANSI SQL implementation becomes the first implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any capability.`spark.sql.ansi.enabled=false` in the same way without losing any capability.
> 
> BTW, the try_* <https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#useful-functions-for-ansi-mode> functions and SQL Error Attribution Framework <https://issues.apache.org/jira/browse/SPARK-38615> will also be beneficial in migrating to ANSI SQL mode.
> 
> 
> Gengliang
> 
> 
> On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
>> Hi, All.
>> 
>> Thanks to you, we've been achieving many things and have on-going SPIPs.
>> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
>> by asking your opinions about Apache Spark's ANSI SQL mode.
>> 
>>     https://issues.apache.org/jira/browse/SPARK-44111
>>     Prepare Apache Spark 4.0.0
>> 
>> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of desirable
>> items for 4.0.0 because it's a big behavior.
>> 
>>     https://issues.apache.org/jira/browse/SPARK-44444
>>     Use ANSI SQL mode by default
>> 
>> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
>> been aiming to provide a better Spark SQL compatibility in a standard way.
>> We also have a daily CI to protect the behavior too.
>> 
>>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
>> 
>> However, it's still behind the configuration with several known issues, e.g.,
>> 
>>     SPARK-41794 Reenable ANSI mode in test_connect_column
>>     SPARK-41547 Reenable ANSI mode in test_connect_functions
>>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>> 
>> To be clear, we know that many DBMSes have their own implementations of
>> SQL standard and not the same. Like them, SPARK-44444 aims to enable
>> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
>> There is nothing more than that.
>> 
>> In other words, the current Spark ANSI SQL implementation becomes the first
>> implementation for Spark SQL users to face at first while providing
>> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
>> 
>> If we don't want this change for some reasons, we can simply exclude
>> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
>> It's time just to make a go/no-go decision for this item for the global optimization
>> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
>> for this again for the next four years until 2028.
>> 
>> WDYT?
>> 
>> Bests,
>> Dongjoon

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

Posted by Gengliang Wang <lt...@gmail.com>.

+1, enabling Spark's ANSI SQL mode in version 4.0 will significantly
enhance data quality and integrity. I fully support this initiative.

> In other words, the current Spark ANSI SQL implementation becomes the
first implementation for Spark SQL users to face at first while providing
`spark.sql.ansi.enabled=false` in the same way without losing any
capability.`spark.sql.ansi.enabled=false` in the same way without losing
any capability.

BTW, the try_*
<https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#useful-functions-for-ansi-mode>
functions and SQL Error Attribution Framework
<https://issues.apache.org/jira/browse/SPARK-38615> will also be beneficial
in migrating to ANSI SQL mode.


Gengliang


On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> Thanks to you, we've been achieving many things and have on-going SPIPs.
> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
> by asking your opinions about Apache Spark's ANSI SQL mode.
>
>     https://issues.apache.org/jira/browse/SPARK-44111
>     Prepare Apache Spark 4.0.0
>
> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of desirable
> items for 4.0.0 because it's a big behavior.
>
>     https://issues.apache.org/jira/browse/SPARK-44444
>     Use ANSI SQL mode by default
>
> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and
> has
> been aiming to provide a better Spark SQL compatibility in a standard way.
> We also have a daily CI to protect the behavior too.
>
>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
>
> However, it's still behind the configuration with several known issues,
> e.g.,
>
>     SPARK-41794 Reenable ANSI mode in test_connect_column
>     SPARK-41547 Reenable ANSI mode in test_connect_functions
>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>
> To be clear, we know that many DBMSes have their own implementations of
> SQL standard and not the same. Like them, SPARK-44444 aims to enable
> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
> There is nothing more than that.
>
> In other words, the current Spark ANSI SQL implementation becomes the first
> implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.
>
> If we don't want this change for some reasons, we can simply exclude
> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
> It's time just to make a go/no-go decision for this item for the global
> optimization
> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
> for this again for the next four years until 2028.
>
> WDYT?
>
> Bests,
> Dongjoon
>