You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Driesprong, Fokko" <fo...@driesprong.frl> on 2019/12/01 11:24:17 UTC

Re: [DISCUSS] PostgreSQL dialect

+1 (non-binding)

Cheers, Fokko

Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <do...@gmail.com>:

> +1
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Yea, +1, that looks pretty reasonable to me.
>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>> from the codebase before it's too late. Curently we only have 3 features
>> under PostgreSQL dialect:
>> I personally think we could at least stop work about the Dialect until
>> 3.0 released.
>>
>>
>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>> gengliang.wang@databricks.com> wrote:
>>
>>> +1 with the practical proposal.
>>> To me, the major concern is that the code base becomes complicated,
>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>> <https://github.com/apache/spark/pull/25697>, but it seems hard to be
>>> clean.
>>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>>> mode, which can be confusing sometimes.
>>>
>>> Gengliang
>>>
>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>>> One particular negative effect has been that new postgresql tests add
>>>>> well over an hour to tests,
>>>>
>>>>
>>>> Adding postgresql tests is for improving the test coverage of Spark
>>>> SQL. We should continue to do this by importing more test cases. The
>>>> quality of Spark highly depends on the test coverage. We can further
>>>> paralyze the test execution to reduce the test time.
>>>>
>>>> Migrating PostgreSQL workloads to Spark SQL
>>>>
>>>>
>>>> This should not be our current focus. In the near future, it is
>>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>>> adding features that are useful to Spark community. PostgreSQL is a good
>>>> reference, but we do not need to blindly follow it. We already closed
>>>> multiple related JIRAs that try to add some PostgreSQL features that are
>>>> not commonly used.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>>
>>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>>> mszymkiewicz@gmail.com> wrote:
>>>>
>>>>> I think it is important to distinguish between two different concepts:
>>>>>
>>>>>    - Adherence to standards and their well established
>>>>>    implementations.
>>>>>    - Enabling migrations from some product X to Spark.
>>>>>
>>>>> While these two problems are related, there are independent and one
>>>>> can be achieved without the other.
>>>>>
>>>>>    - The former approach doesn't imply that all features of SQL
>>>>>    standard (or its specific implementation) are provided. It is sufficient
>>>>>    that commonly used features that are implemented, are standard compliant.
>>>>>    Therefore if end user applies some well known pattern, thing will work as
>>>>>    expected. I
>>>>>
>>>>>    In my personal opinion that's something that is worth the required
>>>>>    development resources, and in general should happen within the project.
>>>>>
>>>>>
>>>>>    - The latter one is more complicated. First of all the premise
>>>>>    that one can "migrate PostgreSQL workloads to Spark" seems to be flawed.
>>>>>    While both Spark and PostgreSQL evolve, and probably have more in common
>>>>>    today, than a few years ago, they're not even close enough to pretend that
>>>>>    one can be replacement for the other. In contrast, existing compatibility
>>>>>    layers between major vendors make sense, because feature disparity (at
>>>>>    least when it comes to core functionality) is usually minimal. And that
>>>>>    doesn't even touch the problem that PostgreSQL provides extensively used
>>>>>    extension points that enable broad and evolving ecosystem (what should we
>>>>>    do about continuous queries? Should Structured Streaming provide some
>>>>>    compatibility layer as well?).
>>>>>
>>>>>    More realistically Spark could provide a compatibility layer with
>>>>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>>>>    but these are not always fully compatible with upstream PostgreSQL, nor
>>>>>    necessarily follow the latest PostgreSQL development.
>>>>>
>>>>>    Furthermore compatibility layer can be, within certain limits
>>>>>    (i.e. availability of required primitives), maintained as a separate
>>>>>    project, without putting more strain on existing resources. Effectively
>>>>>    what we care about here is if we can translate certain SQL string into
>>>>>    logical or physical plan.
>>>>>
>>>>>
>>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Recently we start an effort to achieve feature parity between Spark
>>>>> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>>>
>>>>> This goes very well. We've added many missing features(parser rules,
>>>>> built-in functions, etc.) to Spark, and also corrected several
>>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>>>> Many thanks to all the people that contribute to it!
>>>>>
>>>>> There are several cases when adding a PostgreSQL feature:
>>>>> 1. Spark doesn't have this feature: just add it.
>>>>> 2. Spark has this feature, but the behavior is different:
>>>>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>>>     2.2 Spark's behavior makes sense but violates SQL standard: change
>>>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>>>>> enabled (default false).
>>>>>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>>>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
>>>>> native dialect).
>>>>>
>>>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>>>> too. For example, DB2 provides an oracle dialect
>>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>>>> .
>>>>>
>>>>> However, there are so many differences between Spark and PostgreSQL,
>>>>> including SQL parsing, type coercion, function/operator behavior, data
>>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>>>> the Spark codebase pretty complicated, but still not able to provide a
>>>>> usable PostgreSQL dialect.
>>>>>
>>>>> Furthermore, it's not clear to me how many users have the requirement
>>>>> of migrating PostgreSQL workloads. I think it's much more important to make
>>>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>>>
>>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions,
>>>>> while our own cast function is not ANSI-compliant yet. This makes me think
>>>>> that, we should do something to properly prioritize ANSI mode over other
>>>>> dialects.
>>>>>
>>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>>>> from the codebase before it's too late. Curently we only have 3 features
>>>>> under PostgreSQL dialect:
>>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
>>>>> also allowed as true string.
>>>>> 2. `date - date`  returns interval in Spark (SQL standard behavior),
>>>>> but return int in PostgreSQL
>>>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
>>>>> (there is no standard)
>>>>>
>>>>> We should still add PostgreSQL features that Spark doesn't have, or
>>>>> Spark's behavior violates SQL standard. But for others, let's just update
>>>>> the answer files of PostgreSQL tests.
>>>>>
>>>>> Any comments are welcome!
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Maciej
>>>>>
>>>>>
>>>>
>>>> --
>>>> [image: Databricks Summit - Watch the talks]
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [DISCUSS] PostgreSQL dialect

Posted by Yuanjian Li <xy...@gmail.com>.
Thanks all of you for joining the discussion.
The PR is given in https://github.com/apache/spark/pull/26763, all the
PostgreSQL dialect related PRs are linked in the description.
Hoping the authors could help in reviewing.

Best,
Yuanjian

Driesprong, Fokko <fo...@driesprong.frl> 于2019年12月1日周日 下午7:24写道:

> +1 (non-binding)
>
> Cheers, Fokko
>
> Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <dongjoon.hyun@gmail.com
> >:
>
>> +1
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <li...@gmail.com>
>> wrote:
>>
>>> Yea, +1, that looks pretty reasonable to me.
>>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> I personally think we could at least stop work about the Dialect until
>>> 3.0 released.
>>>
>>>
>>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>>> gengliang.wang@databricks.com> wrote:
>>>
>>>> +1 with the practical proposal.
>>>> To me, the major concern is that the code base becomes complicated,
>>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>>> <https://github.com/apache/spark/pull/25697>, but it seems hard to be
>>>> clean.
>>>> Furthermore, the PostgreSQL dialect configuration overlaps with the
>>>> ANSI mode, which can be confusing sometimes.
>>>>
>>>> Gengliang
>>>>
>>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <li...@databricks.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>> One particular negative effect has been that new postgresql tests add
>>>>>> well over an hour to tests,
>>>>>
>>>>>
>>>>> Adding postgresql tests is for improving the test coverage of Spark
>>>>> SQL. We should continue to do this by importing more test cases. The
>>>>> quality of Spark highly depends on the test coverage. We can further
>>>>> paralyze the test execution to reduce the test time.
>>>>>
>>>>> Migrating PostgreSQL workloads to Spark SQL
>>>>>
>>>>>
>>>>> This should not be our current focus. In the near future, it is
>>>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>>>> adding features that are useful to Spark community. PostgreSQL is a good
>>>>> reference, but we do not need to blindly follow it. We already closed
>>>>> multiple related JIRAs that try to add some PostgreSQL features that are
>>>>> not commonly used.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>>>> mszymkiewicz@gmail.com> wrote:
>>>>>
>>>>>> I think it is important to distinguish between two different concepts:
>>>>>>
>>>>>>    - Adherence to standards and their well established
>>>>>>    implementations.
>>>>>>    - Enabling migrations from some product X to Spark.
>>>>>>
>>>>>> While these two problems are related, there are independent and one
>>>>>> can be achieved without the other.
>>>>>>
>>>>>>    - The former approach doesn't imply that all features of SQL
>>>>>>    standard (or its specific implementation) are provided. It is sufficient
>>>>>>    that commonly used features that are implemented, are standard compliant.
>>>>>>    Therefore if end user applies some well known pattern, thing will work as
>>>>>>    expected. I
>>>>>>
>>>>>>    In my personal opinion that's something that is worth the
>>>>>>    required development resources, and in general should happen within the
>>>>>>    project.
>>>>>>
>>>>>>
>>>>>>    - The latter one is more complicated. First of all the premise
>>>>>>    that one can "migrate PostgreSQL workloads to Spark" seems to be flawed.
>>>>>>    While both Spark and PostgreSQL evolve, and probably have more in common
>>>>>>    today, than a few years ago, they're not even close enough to pretend that
>>>>>>    one can be replacement for the other. In contrast, existing compatibility
>>>>>>    layers between major vendors make sense, because feature disparity
>>>>>>    (at least when it comes to core functionality) is usually
>>>>>>    minimal. And that doesn't even touch the problem that PostgreSQL provides
>>>>>>    extensively used extension points that enable broad and evolving ecosystem
>>>>>>    (what should we do about continuous queries? Should Structured Streaming
>>>>>>    provide some compatibility layer as well?).
>>>>>>
>>>>>>    More realistically Spark could provide a compatibility layer with
>>>>>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>>>>>    but these are not always fully compatible with upstream PostgreSQL, nor
>>>>>>    necessarily follow the latest PostgreSQL development.
>>>>>>
>>>>>>    Furthermore compatibility layer can be, within certain limits
>>>>>>    (i.e. availability of required primitives), maintained as a separate
>>>>>>    project, without putting more strain on existing resources. Effectively
>>>>>>    what we care about here is if we can translate certain SQL string into
>>>>>>    logical or physical plan.
>>>>>>
>>>>>>
>>>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Recently we start an effort to achieve feature parity between Spark
>>>>>> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>>>>
>>>>>> This goes very well. We've added many missing features(parser rules,
>>>>>> built-in functions, etc.) to Spark, and also corrected several
>>>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>>>>> Many thanks to all the people that contribute to it!
>>>>>>
>>>>>> There are several cases when adding a PostgreSQL feature:
>>>>>> 1. Spark doesn't have this feature: just add it.
>>>>>> 2. Spark has this feature, but the behavior is different:
>>>>>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>>>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>>>>     2.2 Spark's behavior makes sense but violates SQL standard:
>>>>>> change the behavior to follow SQL standard and PostgreSQL, when the ansi
>>>>>> mode is enabled (default false).
>>>>>>     2.3 Spark's behavior makes sense and doesn't violate SQL
>>>>>> standard: adds the PostgreSQL behavior under the PostgreSQL dialect
>>>>>> (default is Spark native dialect).
>>>>>>
>>>>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>>>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>>>>> too. For example, DB2 provides an oracle dialect
>>>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>>>>> .
>>>>>>
>>>>>> However, there are so many differences between Spark and PostgreSQL,
>>>>>> including SQL parsing, type coercion, function/operator behavior, data
>>>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>>>>> the Spark codebase pretty complicated, but still not able to provide a
>>>>>> usable PostgreSQL dialect.
>>>>>>
>>>>>> Furthermore, it's not clear to me how many users have the requirement
>>>>>> of migrating PostgreSQL workloads. I think it's much more important to make
>>>>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>>>>
>>>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions,
>>>>>> while our own cast function is not ANSI-compliant yet. This makes me think
>>>>>> that, we should do something to properly prioritize ANSI mode over other
>>>>>> dialects.
>>>>>>
>>>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove
>>>>>> it from the codebase before it's too late. Curently we only have 3 features
>>>>>> under PostgreSQL dialect:
>>>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
>>>>>> also allowed as true string.
>>>>>> 2. `date - date`  returns interval in Spark (SQL standard behavior),
>>>>>> but return int in PostgreSQL
>>>>>> 3. `int / int` returns double in Spark, but returns int in
>>>>>> PostgreSQL. (there is no standard)
>>>>>>
>>>>>> We should still add PostgreSQL features that Spark doesn't have, or
>>>>>> Spark's behavior violates SQL standard. But for others, let's just update
>>>>>> the answer files of PostgreSQL tests.
>>>>>>
>>>>>> Any comments are welcome!
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Maciej
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>