You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jungtaek Lim <ka...@gmail.com> on 2020/03/16 02:54:48 UTC

[DISCUSS] Resolve ambiguous parser rule between two "create table"s

Hi devs,

I'd like to initiate discussion and hear the voices for resolving ambiguous
parser rule between two "create table"s being brought by SPARK-30098 [1].

Previously, "create table" parser rules were clearly distinguished via
"USING provider", which was very intuitive and deterministic. Say, DDL
query creates "Hive" table unless "USING provider" is specified,
(Please refer the parser rule in branch-2.4 [2])

After SPARK-30098, "create table" parser rules became ambiguous (please
refer the parser rule in branch-3.0 [3]) - the factors differentiating two
rules are only "ROW FORMAT" and "STORED AS" which are all defined as
"optional". Now it relies on the "order" of parser rule which end users
would have no idea to reason about, and very unintuitive.

Furthermore, undocumented rule of EXTERNAL (added in the first rule to
provide better message) brought more confusion (I've described the broken
existing query via SPARK-30436 [4]).

Personally I'd like to see two rules mutually exclusive, instead of trying
to document the difference and talk end users to be careful about their
query. I'm seeing two ways to make rules be mutually exclusive:

1. Add some identifier in create Hive table rule, like `CREATE ... "HIVE"
TABLE ...`.

pros. This is the simplest way to distinguish between two rules.
cons. This would lead end users to change their query if they intend to
create Hive table. (Given we will also provide legacy option I'm feeling
this is acceptable.)

2. Define "ROW FORMAT" or "STORED AS" as mandatory one.

pros. Less invasive for existing queries.
cons. Less intuitive, because they have been optional and now become
mandatory to fall into the second rule.

Would like to hear everyone's voices; better ideas are welcome!

Thanks,
Jungtaek Lim (HeartSaVioR)

1. SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
https://issues.apache.org/jira/browse/SPARK-30098
2.
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
3.
https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
4. https://issues.apache.org/jira/browse/SPARK-30436

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Anything would be OK if the create table DDL provides a "clear way" to
expect the table provider "before" they run the query. Great news that it
doesn't require major rework - looking forward to the PR.

Thanks again to jump in and sort this out.

- Jungtaek Lim (HeartSaVioR)

On Fri, Mar 20, 2020 at 9:36 AM Ryan Blue <rb...@netflix.com> wrote:

> I have an update to the parser that unifies the CREATE TABLE rules. It
> took surprisingly little work to get the parser updated to produce
> CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
> And the only fields I need to add to those statements were serde:
> SerdeInfo and external: Boolean.
>
> From there, we can use the conversion rules to re-create the same Hive
> command for v1 or pass the data as properties for v2. I’ll work on getting
> this cleaned up and open a PR hopefully tomorrow.
>
> For the questions about how this gets converted to either a Spark or Hive
> create table command, that is really up to analyzer rules and
> configuration. With my changes, it is no longer determined by the parser:
> the parser just produces a node that includes all of the user options and
> Spark decides what to do with that in the analyzer. Also, there's already
> an option to convert Hive syntax to a Spark
> command, spark.sql.hive.convertCTAS.
>
> rb
>
> On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Big +1 to have one single unified CREATE TABLE syntax.
>>
>> In general, we can say there are 2 ways to specify the table provider:
>> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
>> exclusive. If none is specified, it implicitly indicates USING
>> defaultSource.
>>
>> I'm fine with a few special cases which can indicate the table provider,
>> like EXTERNAL indicates Hive Serde table. A few thoughts:
>> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
>> nice error message. We can support it in the unified syntax as well, and
>> fail it.
>> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
>> syntax. Just make sure it doesn't appear together with PARTITIONED BY
>> transformList.
>> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
>> indicate non-Hive tables.
>>
>> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>>> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
>>> may add the confusion.
>>>
>>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>>> like to see in discussion thread.
>>>
>>> I'm open to solutions which enable end users to specify their intention
>>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>>> provider the query will use in create table unless USING provider is
>>> explicitly specified. If the new proposal makes clear on this, that should
>>> be better than now.
>>>
>>> Replying inline:
>>>
>>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> Side comment: The current docs for CREATE TABLE
>>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
>>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>>>> actually part of the syntax
>>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
>>>> .
>>>>
>>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Jungtaek, it sounds like you consider the two rules to be separate
>>>>> syntaxes with their own consistency rules. For example, if I am using the
>>>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>>>> columns and requires types for those columns; if I’m using the Spark syntax
>>>>> rule with USING then PARTITIONED BY must reference existing columns
>>>>> and cannot include types.
>>>>>
>>>>> I agree that this is confusing to users! We should fix it, but I don’t
>>>>> think the right solution is to continue to have two rules with divergent
>>>>> syntax.
>>>>>
>>>>> This is confusing to users because they don’t know anything about
>>>>> separate parser rules. All the user sees is that sometimes PARTITION
>>>>> BY requires types and sometimes it doesn’t. Yes, we could add a
>>>>> keyword, HIVE, to signal that the syntax is borrowed from Hive for
>>>>> that case, but that actually breaks queries that run in Hive.
>>>>>
>>>> That might less matter, because SPARK-30098 (and I guess your proposal
>>> as well) enforces end users to add "USING HIVE" for their queries to enable
>>> Hive provider in any way, even only when the query matches with rule 1
>>> (conditional). Once they decide to create Hive table, the query might have
>>> to be changed, or they have to change the default provider, or they have to
>>> enable legacy config.
>>>
>>>
>>>> I think the right solution is to unify the two syntaxes. I don’t think
>>>>> they are so different that it isn’t possible. Here are the differences I
>>>>> see:
>>>>>
>>>>>    - Only in Hive:
>>>>>       - EXTERNAL
>>>>>       - skewSpec: SKEWED BY ...
>>>>>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>>>>       - createFileFormat: STORED AS ...
>>>>>    - Only in Spark:
>>>>>       - OPTIONS property list
>>>>>    - Different syntax/interpretation:
>>>>>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>>>>
>>>>> ":" after column name is another one only supported in Hive, though
>>> that's relatively minor to support it in unified syntax.
>>>
>>>>
>>>>>    -
>>>>>
>>>>> For the clauses that are supported in one but not the other, we can
>>>>> add them to a unified rule as optional clauses. The AST builder would then
>>>>> validate what makes sense or not (e.g., stored as with using or row format
>>>>> delimited) and finally pass the remaining data on using the
>>>>> CreateTableStatement. That statement would be handled like we do for
>>>>> the Spark rule today, but with extra metadata to pass along. This is also a
>>>>> step toward being able to put Hive behind the DSv2 API because we’d be able
>>>>> to pass all of the Hive metadata clauses to the v2 catalog.
>>>>>
>>>>> The only difficult part is handling PARTITIONED BY. But in that case,
>>>>> we can use two different syntaxes from the same CREATE TABLE rule. If
>>>>> types are included, we use the Hive PARTITIONED BY syntax and convert
>>>>> in the AST builder to normalize to a single representation.
>>>>>
>>>> The proposal looks promising - it may add some sort of complexity but
>>> sounds like worth adding.
>>>
>>> One thing to make clear - in unified syntax we only rely on explicit
>>> provider, or default provider, right? I would concern if the proposal
>>> automatically uses Hive provider if Hive specific clauses are being used.
>>> Yes as I said earlier it may make end users' query to be changed, but
>>> better than uncertain.
>>>
>>> Btw, if the main purpose to add native syntax and change it by default
>>> is to discontinue supporting Hive create table rule sooner, simply dropping
>>> rule 2 with providing legacy config is still one of valid options I think.
>>>
>>>
>>>> What do you both think? This would make the behavior more clear and
>>>>> take a step toward getting rid of Hive-specific code.
>>>>>
>>>>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> I'm trying to understand the reason you have been suggesting to keep
>>>>>> the real thing unchanged but change doc instead. Could you please elaborate
>>>>>> why? End users would blame us when they hit the case their query doesn't
>>>>>> work as intended (1) and found the fact it's undocumented (2) and hard to
>>>>>> understand even from the Spark codebase (3).
>>>>>>
>>>>>> For me, addressing the root issue adopting your suggestion would be
>>>>>> "dropping the rule 2" and only supporting it with legacy config on. We
>>>>>> would say to end users, you need to enable the legacy config to leverage
>>>>>> Hive create table syntax, or just use beeline with Hive connected.
>>>>>>
>>>>>> But since we are even thinking about native syntax as a first class
>>>>>> and dropping Hive one implicitly (hide in doc) or explicitly, does it
>>>>>> really matter we require a marker (like "HIVE") in rule 2 and isolate it?
>>>>>> It would have even less confusion than Spark 2.x, since we will require end
>>>>>> users to fill the marker regarding Hive when creating Hive table, easier to
>>>>>> classify than "USING provider".
>>>>>>
>>>>>> If we think native syntax would cover many cases end users have been
>>>>>> creating Hive table in Spark (say, USING hive would simply work for them),
>>>>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
>>>>>> really needed. If not, let's continue "fixing" the issue.
>>>>>>
>>>>>> (Another valid approach would be consolidating two rules into one,
>>>>>> and defining support of parameters per provider, e.g. EXTERNAL, STORED AS,
>>>>>> ROW FORMAT, etc. are only supported in Hive provider.)
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The fact that we have 2 CREATE TABLE syntax is already confusing
>>>>>>> many users. Shall we only document the native syntax? Then users don't need
>>>>>>> to worry about which rule their query fits and they don't need to spend a
>>>>>>> lot of time understanding the subtle difference between these 2 syntaxes.
>>>>>>>
>>>>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> A bit correction: the example I provided for vice versa is not
>>>>>>>> really a correct case for vice versa. It's actually same case (intended to
>>>>>>>> use rule 2 which is not default) but different result.
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> My concern is that although we simply think about the change to
>>>>>>>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>>>>>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>>>>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>>>>>>>> swap happens as intended), and there're still couple of things which make
>>>>>>>>> the query still fall into rule 2 which is non-trivial to reason about and
>>>>>>>>> also not easy to explain.
>>>>>>>>>
>>>>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into
>>>>>>>>> rule 2 to simplify the problem statement, but they're not only the case
>>>>>>>>> - using col_name1:col_type1 would make the query fall into rule 2
>>>>>>>>> regardless of any other properties.
>>>>>>>>>
>>>>>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1
>>>>>>>>> (which is not default) and specify the parameters which are only available
>>>>>>>>> in rule 1, it clearly requires "USING provider" - parser will throw error
>>>>>>>>> if there're any mistakes on specifying parameters. End users could set
>>>>>>>>> intention clearly which rule the query should be bound. If the query fails
>>>>>>>>> to bind the rule as intended, it's simply denied.
>>>>>>>>>
>>>>>>>>> In Spark 3.x, parser may not help figuring out the case where end
>>>>>>>>> users intend to use rule 2 (which is not default) but some mistake on
>>>>>>>>> specifying parameters - it could just "silently" bound into rule 1 and it
>>>>>>>>> may be even executed without any error. Vice versa happens, but in odd way
>>>>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE
>>>>>>>>> EXTERNAL TABLE is not supported.
>>>>>>>>>
>>>>>>>>> It's deterministic for end users only if they're fully understand
>>>>>>>>> the difference between the twos and also understand how Spark would apply
>>>>>>>>> the rules to make the query fall into one of the rule. I'm not sure how we
>>>>>>>>> can only improve documentation to make things be clear, but if the approach
>>>>>>>>> would be explaining the difference of rules and guide the tip to make the
>>>>>>>>> query be bound to the specific rule, the same could be applied to parser
>>>>>>>>> rule to address the root cause.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits
>>>>>>>>>> both native and Hive syntax. I'm fine with some changes to make it less
>>>>>>>>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>>>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>>>>>>>> is false.
>>>>>>>>>>
>>>>>>>>>> I still don't get your point about what's the real problem to end
>>>>>>>>>> users. There is no ambiguity as the behavior is deterministic, although we
>>>>>>>>>> rely on optional fields and rule order which is bad. It's hard to document,
>>>>>>>>>> but I don't think that's a big problem to end users.
>>>>>>>>>>
>>>>>>>>>> For the legacy config, it does make the implementation more
>>>>>>>>>> complicated, but it's invisible to most end users (we don't document it)
>>>>>>>>>> and can be super useful to some users who want the queries to keep working
>>>>>>>>>> in 3.0 without rewriting.
>>>>>>>>>>
>>>>>>>>>> If your only concern is documentation, I totally agree that we
>>>>>>>>>> should improve it.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for sharing your view.
>>>>>>>>>>>
>>>>>>>>>>> I agree with you it's good for Spark to promote Spark's own
>>>>>>>>>>> CREATE TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>>>>>>>>> with.
>>>>>>>>>>>
>>>>>>>>>>> I'll quote my comments in SPARK-31136 here again to make the
>>>>>>>>>>> problem statement be clearer:
>>>>>>>>>>>
>>>>>>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because we
>>>>>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either
>>>>>>>>>>> default provider or Hive according to which options are provided, which
>>>>>>>>>>> seems to be non-trivial to reason about. (End users would never know, as
>>>>>>>>>>> it's completely from parser rule.)
>>>>>>>>>>>
>>>>>>>>>>> I feel this as the issue of "not breaking old behavior". The
>>>>>>>>>>> parser rule gets pretty much complicated due to support legacy config. Not
>>>>>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>>>>>>
>>>>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will
>>>>>>>>>>> use the value of spark.sql.sources.default as its provider. In Spark
>>>>>>>>>>> version 2.4 and earlier, it was hive. To restore the behavior before Spark
>>>>>>>>>>> 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>>>>>>> didn't describe anything for this.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>>>>>>
>>>>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>>>>>>>>> AS select_statement ]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>>>>>>
>>>>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> At least we should describe that parser will try to match the
>>>>>>>>>>> first case (create table ~ using data source), and fail back to second
>>>>>>>>>>> case; even though we describe this it's not intuitive to reason about which
>>>>>>>>>>> rule the DDL query will fall into. As I commented earlier, "ROW FORMAT" and
>>>>>>>>>>> "STORED AS" are the options which make DDL query fall into the second case,
>>>>>>>>>>> but they're described as "optional" so it's hard to catch the gotcha.
>>>>>>>>>>>
>>>>>>>>>>> Furthermore, while we document the syntax as above, in reality
>>>>>>>>>>> we allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>>>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Simply saying, do we really think end users should stop and try
>>>>>>>>>>> to match their query against the parser rules (or orders when we explain in
>>>>>>>>>>> the doc) by themselves to understand which provider the table will
>>>>>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>>>>>>>>> which is a serious problem.
>>>>>>>>>>>
>>>>>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then
>>>>>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" one and
>>>>>>>>>>> try to isolate each other? What's the point of providing a legacy config to
>>>>>>>>>>> go back to the old one even we fear about breaking something to make it
>>>>>>>>>>> better or clearer? We do think that table provider is important (hence the
>>>>>>>>>>> change was done), then is it still a trivial problem whether the provider
>>>>>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think the general guideline is to promote Spark's own CREATE
>>>>>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are
>>>>>>>>>>>> mutually exclusive because the native syntax requires the USING clause
>>>>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>>>>>>
>>>>>>>>>>>> It's a good move to make the USING clause optional, which makes
>>>>>>>>>>>> it easier to write the native CREATE TABLE syntax. Unfortunately, it leads
>>>>>>>>>>>> to some conflicts with the Hive CREATE TABLE syntax, but I don't see a
>>>>>>>>>>>> serious problem here. If a user just writes CREATE TABLE without USING
>>>>>>>>>>>> or ROW FORMAT or STORED AS, does it matter what table we create? Internally
>>>>>>>>>>>> the parser rules conflict and we pick the native syntax depending on the
>>>>>>>>>>>> rule order. But the user-facing behavior looks fine.
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not
>>>>>>>>>>>> in 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE
>>>>>>>>>>>> syntax? Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to initiate discussion and hear the voices for
>>>>>>>>>>>>> resolving ambiguous parser rule between two "create table"s being brought
>>>>>>>>>>>>> by SPARK-30098 [1].
>>>>>>>>>>>>>
>>>>>>>>>>>>> Previously, "create table" parser rules were clearly
>>>>>>>>>>>>> distinguished via "USING provider", which was very intuitive and
>>>>>>>>>>>>> deterministic. Say, DDL query creates "Hive" table unless "USING provider"
>>>>>>>>>>>>> is specified,
>>>>>>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>>>>>>
>>>>>>>>>>>>> After SPARK-30098, "create table" parser rules became
>>>>>>>>>>>>> ambiguous (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first
>>>>>>>>>>>>> rule to provide better message) brought more confusion (I've described the
>>>>>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Personally I'd like to see two rules mutually exclusive,
>>>>>>>>>>>>> instead of trying to document the difference and talk end users to be
>>>>>>>>>>>>> careful about their query. I'm seeing two ways to make rules be mutually
>>>>>>>>>>>>> exclusive:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE
>>>>>>>>>>>>> ... "HIVE" TABLE ...`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pros. This is the simplest way to distinguish between two
>>>>>>>>>>>>> rules.
>>>>>>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>>>>>>> intend to create Hive table. (Given we will also provide legacy option I'm
>>>>>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>>>>>> TABLE syntax
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>>>>>>> 2.
>>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>>>> 3.
>>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I have an update to the parser that unifies the CREATE TABLE rules. It took
surprisingly little work to get the parser updated to produce
CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
And the only fields I need to add to those statements were serde: SerdeInfo
and external: Boolean.

From there, we can use the conversion rules to re-create the same Hive
command for v1 or pass the data as properties for v2. I’ll work on getting
this cleaned up and open a PR hopefully tomorrow.

For the questions about how this gets converted to either a Spark or Hive
create table command, that is really up to analyzer rules and
configuration. With my changes, it is no longer determined by the parser:
the parser just produces a node that includes all of the user options and
Spark decides what to do with that in the analyzer. Also, there's already
an option to convert Hive syntax to a Spark
command, spark.sql.hive.convertCTAS.

rb

On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan <cl...@gmail.com> wrote:

> Big +1 to have one single unified CREATE TABLE syntax.
>
> In general, we can say there are 2 ways to specify the table provider:
> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
> exclusive. If none is specified, it implicitly indicates USING
> defaultSource.
>
> I'm fine with a few special cases which can indicate the table provider,
> like EXTERNAL indicates Hive Serde table. A few thoughts:
> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
> nice error message. We can support it in the unified syntax as well, and
> fail it.
> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
> syntax. Just make sure it doesn't appear together with PARTITIONED BY
> transformList.
> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
> indicate non-Hive tables.
>
> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
>> may add the confusion.
>>
>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>> like to see in discussion thread.
>>
>> I'm open to solutions which enable end users to specify their intention
>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>> provider the query will use in create table unless USING provider is
>> explicitly specified. If the new proposal makes clear on this, that should
>> be better than now.
>>
>> Replying inline:
>>
>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> Side comment: The current docs for CREATE TABLE
>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>>> actually part of the syntax
>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
>>> .
>>>
>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Jungtaek, it sounds like you consider the two rules to be separate
>>>> syntaxes with their own consistency rules. For example, if I am using the
>>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>>> columns and requires types for those columns; if I’m using the Spark syntax
>>>> rule with USING then PARTITIONED BY must reference existing columns
>>>> and cannot include types.
>>>>
>>>> I agree that this is confusing to users! We should fix it, but I don’t
>>>> think the right solution is to continue to have two rules with divergent
>>>> syntax.
>>>>
>>>> This is confusing to users because they don’t know anything about
>>>> separate parser rules. All the user sees is that sometimes PARTITION BY
>>>> requires types and sometimes it doesn’t. Yes, we could add a keyword,
>>>> HIVE, to signal that the syntax is borrowed from Hive for that case,
>>>> but that actually breaks queries that run in Hive.
>>>>
>>> That might less matter, because SPARK-30098 (and I guess your proposal
>> as well) enforces end users to add "USING HIVE" for their queries to enable
>> Hive provider in any way, even only when the query matches with rule 1
>> (conditional). Once they decide to create Hive table, the query might have
>> to be changed, or they have to change the default provider, or they have to
>> enable legacy config.
>>
>>
>>> I think the right solution is to unify the two syntaxes. I don’t think
>>>> they are so different that it isn’t possible. Here are the differences I
>>>> see:
>>>>
>>>>    - Only in Hive:
>>>>       - EXTERNAL
>>>>       - skewSpec: SKEWED BY ...
>>>>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>>>       - createFileFormat: STORED AS ...
>>>>    - Only in Spark:
>>>>       - OPTIONS property list
>>>>    - Different syntax/interpretation:
>>>>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>>>
>>>> ":" after column name is another one only supported in Hive, though
>> that's relatively minor to support it in unified syntax.
>>
>>>
>>>>    -
>>>>
>>>> For the clauses that are supported in one but not the other, we can add
>>>> them to a unified rule as optional clauses. The AST builder would then
>>>> validate what makes sense or not (e.g., stored as with using or row format
>>>> delimited) and finally pass the remaining data on using the
>>>> CreateTableStatement. That statement would be handled like we do for
>>>> the Spark rule today, but with extra metadata to pass along. This is also a
>>>> step toward being able to put Hive behind the DSv2 API because we’d be able
>>>> to pass all of the Hive metadata clauses to the v2 catalog.
>>>>
>>>> The only difficult part is handling PARTITIONED BY. But in that case,
>>>> we can use two different syntaxes from the same CREATE TABLE rule. If
>>>> types are included, we use the Hive PARTITIONED BY syntax and convert
>>>> in the AST builder to normalize to a single representation.
>>>>
>>> The proposal looks promising - it may add some sort of complexity but
>> sounds like worth adding.
>>
>> One thing to make clear - in unified syntax we only rely on explicit
>> provider, or default provider, right? I would concern if the proposal
>> automatically uses Hive provider if Hive specific clauses are being used.
>> Yes as I said earlier it may make end users' query to be changed, but
>> better than uncertain.
>>
>> Btw, if the main purpose to add native syntax and change it by default is
>> to discontinue supporting Hive create table rule sooner, simply dropping
>> rule 2 with providing legacy config is still one of valid options I think.
>>
>>
>>> What do you both think? This would make the behavior more clear and take
>>>> a step toward getting rid of Hive-specific code.
>>>>
>>>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> I'm trying to understand the reason you have been suggesting to keep
>>>>> the real thing unchanged but change doc instead. Could you please elaborate
>>>>> why? End users would blame us when they hit the case their query doesn't
>>>>> work as intended (1) and found the fact it's undocumented (2) and hard to
>>>>> understand even from the Spark codebase (3).
>>>>>
>>>>> For me, addressing the root issue adopting your suggestion would be
>>>>> "dropping the rule 2" and only supporting it with legacy config on. We
>>>>> would say to end users, you need to enable the legacy config to leverage
>>>>> Hive create table syntax, or just use beeline with Hive connected.
>>>>>
>>>>> But since we are even thinking about native syntax as a first class
>>>>> and dropping Hive one implicitly (hide in doc) or explicitly, does it
>>>>> really matter we require a marker (like "HIVE") in rule 2 and isolate it?
>>>>> It would have even less confusion than Spark 2.x, since we will require end
>>>>> users to fill the marker regarding Hive when creating Hive table, easier to
>>>>> classify than "USING provider".
>>>>>
>>>>> If we think native syntax would cover many cases end users have been
>>>>> creating Hive table in Spark (say, USING hive would simply work for them),
>>>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
>>>>> really needed. If not, let's continue "fixing" the issue.
>>>>>
>>>>> (Another valid approach would be consolidating two rules into one, and
>>>>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
>>>>> FORMAT, etc. are only supported in Hive provider.)
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The fact that we have 2 CREATE TABLE syntax is already confusing many
>>>>>> users. Shall we only document the native syntax? Then users don't need to
>>>>>> worry about which rule their query fits and they don't need to spend a lot
>>>>>> of time understanding the subtle difference between these 2 syntaxes.
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> A bit correction: the example I provided for vice versa is not
>>>>>>> really a correct case for vice versa. It's actually same case (intended to
>>>>>>> use rule 2 which is not default) but different result.
>>>>>>>
>>>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> My concern is that although we simply think about the change to
>>>>>>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>>>>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>>>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>>>>>>> swap happens as intended), and there're still couple of things which make
>>>>>>>> the query still fall into rule 2 which is non-trivial to reason about and
>>>>>>>> also not easy to explain.
>>>>>>>>
>>>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into
>>>>>>>> rule 2 to simplify the problem statement, but they're not only the case
>>>>>>>> - using col_name1:col_type1 would make the query fall into rule 2
>>>>>>>> regardless of any other properties.
>>>>>>>>
>>>>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1
>>>>>>>> (which is not default) and specify the parameters which are only available
>>>>>>>> in rule 1, it clearly requires "USING provider" - parser will throw error
>>>>>>>> if there're any mistakes on specifying parameters. End users could set
>>>>>>>> intention clearly which rule the query should be bound. If the query fails
>>>>>>>> to bind the rule as intended, it's simply denied.
>>>>>>>>
>>>>>>>> In Spark 3.x, parser may not help figuring out the case where end
>>>>>>>> users intend to use rule 2 (which is not default) but some mistake on
>>>>>>>> specifying parameters - it could just "silently" bound into rule 1 and it
>>>>>>>> may be even executed without any error. Vice versa happens, but in odd way
>>>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE
>>>>>>>> EXTERNAL TABLE is not supported.
>>>>>>>>
>>>>>>>> It's deterministic for end users only if they're fully understand
>>>>>>>> the difference between the twos and also understand how Spark would apply
>>>>>>>> the rules to make the query fall into one of the rule. I'm not sure how we
>>>>>>>> can only improve documentation to make things be clear, but if the approach
>>>>>>>> would be explaining the difference of rules and guide the tip to make the
>>>>>>>> query be bound to the specific rule, the same could be applied to parser
>>>>>>>> rule to address the root cause.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits
>>>>>>>>> both native and Hive syntax. I'm fine with some changes to make it less
>>>>>>>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>>>>>>> is false.
>>>>>>>>>
>>>>>>>>> I still don't get your point about what's the real problem to end
>>>>>>>>> users. There is no ambiguity as the behavior is deterministic, although we
>>>>>>>>> rely on optional fields and rule order which is bad. It's hard to document,
>>>>>>>>> but I don't think that's a big problem to end users.
>>>>>>>>>
>>>>>>>>> For the legacy config, it does make the implementation more
>>>>>>>>> complicated, but it's invisible to most end users (we don't document it)
>>>>>>>>> and can be super useful to some users who want the queries to keep working
>>>>>>>>> in 3.0 without rewriting.
>>>>>>>>>
>>>>>>>>> If your only concern is documentation, I totally agree that we
>>>>>>>>> should improve it.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for sharing your view.
>>>>>>>>>>
>>>>>>>>>> I agree with you it's good for Spark to promote Spark's own
>>>>>>>>>> CREATE TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>>>>>>>> with.
>>>>>>>>>>
>>>>>>>>>> I'll quote my comments in SPARK-31136 here again to make the
>>>>>>>>>> problem statement be clearer:
>>>>>>>>>>
>>>>>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because we
>>>>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either
>>>>>>>>>> default provider or Hive according to which options are provided, which
>>>>>>>>>> seems to be non-trivial to reason about. (End users would never know, as
>>>>>>>>>> it's completely from parser rule.)
>>>>>>>>>>
>>>>>>>>>> I feel this as the issue of "not breaking old behavior". The
>>>>>>>>>> parser rule gets pretty much complicated due to support legacy config. Not
>>>>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>>>>>
>>>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will
>>>>>>>>>> use the value of spark.sql.sources.default as its provider. In Spark
>>>>>>>>>> version 2.4 and earlier, it was hive. To restore the behavior before Spark
>>>>>>>>>> 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>>>>>> didn't describe anything for this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>>>>>
>>>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>>>>>>>> AS select_statement ]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>>>>>
>>>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> At least we should describe that parser will try to match the
>>>>>>>>>> first case (create table ~ using data source), and fail back to second
>>>>>>>>>> case; even though we describe this it's not intuitive to reason about which
>>>>>>>>>> rule the DDL query will fall into. As I commented earlier, "ROW FORMAT" and
>>>>>>>>>> "STORED AS" are the options which make DDL query fall into the second case,
>>>>>>>>>> but they're described as "optional" so it's hard to catch the gotcha.
>>>>>>>>>>
>>>>>>>>>> Furthermore, while we document the syntax as above, in reality we
>>>>>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Simply saying, do we really think end users should stop and try
>>>>>>>>>> to match their query against the parser rules (or orders when we explain in
>>>>>>>>>> the doc) by themselves to understand which provider the table will
>>>>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>>>>>>>> which is a serious problem.
>>>>>>>>>>
>>>>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then
>>>>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" one and
>>>>>>>>>> try to isolate each other? What's the point of providing a legacy config to
>>>>>>>>>> go back to the old one even we fear about breaking something to make it
>>>>>>>>>> better or clearer? We do think that table provider is important (hence the
>>>>>>>>>> change was done), then is it still a trivial problem whether the provider
>>>>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think the general guideline is to promote Spark's own CREATE
>>>>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are
>>>>>>>>>>> mutually exclusive because the native syntax requires the USING clause
>>>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>>>>>
>>>>>>>>>>> It's a good move to make the USING clause optional, which makes
>>>>>>>>>>> it easier to write the native CREATE TABLE syntax. Unfortunately, it leads
>>>>>>>>>>> to some conflicts with the Hive CREATE TABLE syntax, but I don't see a
>>>>>>>>>>> serious problem here. If a user just writes CREATE TABLE without USING
>>>>>>>>>>> or ROW FORMAT or STORED AS, does it matter what table we create? Internally
>>>>>>>>>>> the parser rules conflict and we pick the native syntax depending on the
>>>>>>>>>>> rule order. But the user-facing behavior looks fine.
>>>>>>>>>>>
>>>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE syntax?
>>>>>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to initiate discussion and hear the voices for
>>>>>>>>>>>> resolving ambiguous parser rule between two "create table"s being brought
>>>>>>>>>>>> by SPARK-30098 [1].
>>>>>>>>>>>>
>>>>>>>>>>>> Previously, "create table" parser rules were clearly
>>>>>>>>>>>> distinguished via "USING provider", which was very intuitive and
>>>>>>>>>>>> deterministic. Say, DDL query creates "Hive" table unless "USING provider"
>>>>>>>>>>>> is specified,
>>>>>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>>>>>
>>>>>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>>>>>>>
>>>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first
>>>>>>>>>>>> rule to provide better message) brought more confusion (I've described the
>>>>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>>>>
>>>>>>>>>>>> Personally I'd like to see two rules mutually exclusive,
>>>>>>>>>>>> instead of trying to document the difference and talk end users to be
>>>>>>>>>>>> careful about their query. I'm seeing two ways to make rules be mutually
>>>>>>>>>>>> exclusive:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE
>>>>>>>>>>>> ... "HIVE" TABLE ...`.
>>>>>>>>>>>>
>>>>>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>>>>>> intend to create Hive table. (Given we will also provide legacy option I'm
>>>>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>>>>>
>>>>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>>>>>
>>>>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>>
>>>>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>>>>> TABLE syntax
>>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>>>>>> 2.
>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>>> 3.
>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Never mind, forget about the dead code. Turned out that reverting "via
manual" can be very easily done - remove config and apply to the tests ->
remove field -> remove the changes into docs. It was considered as
non-trivial because we only consider about "git revert" but there's no
strict rule to do so.

Please take a look at this PR, https://github.com/apache/spark/pull/28517

On Wed, May 13, 2020 at 3:10 AM Russell Spitzer <ru...@gmail.com>
wrote:

> I think that the dead code approach, while a bit unpalatable and worse
> than reverting, is probably better than leaving the parameter (even if it
> is hidden)
>
> On Tue, May 12, 2020 at 12:46 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> +1 for the approach Jungtaek suggests. That will avoid needing to support
>> behavior that is not well understood with minimal changes.
>>
>> On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Before I forget, we'd better not forget to change the doc, as create
>>> table doc looks to represent current syntax which will be incorrect later.
>>>
>>> On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> It's not only for end users, but also for us. Spark itself uses the
>>>> config "true" and "false" in tests and it still brings confusion. We still
>>>> have to deal with both situations.
>>>>
>>>> I'm wondering how long days it would be needed to revert it cleanly,
>>>> but if we worry about the amount of code change just around the new RC,
>>>> what about make the code dirty (should be fixed soon) but less headache via
>>>> applying traditional (and bad) way?
>>>>
>>>> Let's just remove the config so that the config cannot be used in any
>>>> way (even in Spark codebase), and set corresponding field in parser to the
>>>> constant value so that no one can modify in any way. This would make the
>>>> dead code by intention which should be cleaned it up later, so let's add
>>>> FIXME comment there so that anyone can take it up for cleaning up the code
>>>> later. (If no one volunteers then I'll probably pick up.)
>>>>
>>>> That is a bad pattern, but still better as we prevent end users (even
>>>> early adopters) go through the undocumented path in any way, and that will
>>>> be explicitly marked as "should be fixed". This is different from retaining
>>>> config - I don't expect unified create table syntax will be landed in
>>>> bugfix version, so even unified create table syntax can be landed in 3.1.0
>>>> (this is also not guaranteed) the config will live in 3.0.x in any way. If
>>>> we temporarily go dirty way then we can clean up the code in any version,
>>>> even from bugfix version, maybe within a couple of weeks just after 3.0.0
>>>> is released.
>>>>
>>>> Does it sound valid?
>>>>
>>>> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> SPARK-30098 was merged about 6 months ago. It's not a clean revert and
>>>>> we may need to spend quite a bit of time to resolve conflicts and fix tests.
>>>>>
>>>>> I don't see why it's still a problem if a feature is disabled and
>>>>> hidden from end-users (it's undocumented, the config is internal). The
>>>>> related code will be replaced in the master branch sooner or later, when we
>>>>> unify the syntaxes.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 12, 2020 at 6:16 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I'm all for getting the unified syntax into master. The only issue
>>>>>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>>>>>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>>>>>> that issue so we're not stuck for another 6 weeks on it.
>>>>>>
>>>>>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> Btw another wondering here is, is it good to retain the flag on
>>>>>>> master as an intermediate step? Wouldn't it be better for us to
>>>>>>> start "unified create table syntax" from scratch?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>>>>>>>> option 1 because it's less worse than option 2, but it doesn't mean I fully
>>>>>>>> agree with option 1.
>>>>>>>>
>>>>>>>> Let's make below things clear if we really go with option 1,
>>>>>>>> otherwise please consider reverting it.
>>>>>>>>
>>>>>>>> * Do you fully indicate about "all" the paths where the second
>>>>>>>> create table syntax is taken?
>>>>>>>> * Could you explain "why" to end users without any confusion? Do
>>>>>>>> you think end users will understand it easily?
>>>>>>>> * Do you have an actual end users to guide to turn this on? Or do
>>>>>>>> you have a plan to turn this on for your team/customers and deal with
>>>>>>>> the ambiguity?
>>>>>>>> * Could you please document about how things will change if the
>>>>>>>> flag is turned on?
>>>>>>>>
>>>>>>>> I guess the option 1 is to leave a flag as "undocumented" one and
>>>>>>>> forget about the path to turn on, but I think that would lead to make the
>>>>>>>> feature be "broken window" even we are not able to touch.
>>>>>>>>
>>>>>>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I think reverting 30098 is the right decision here if we want to
>>>>>>>>> unblock 3.0. We shouldn't ship with features which we know do not function
>>>>>>>>> in the way we intend, regardless of how little exposure most users have to
>>>>>>>>> them. Even if it's off my default, we should probably work to avoid
>>>>>>>>> switches that cause things to behave unpredictably or require a flow chart
>>>>>>>>> to actually determine what will happen.
>>>>>>>>>
>>>>>>>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue
>>>>>>>>> <rb...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> I'm all for fixing behavior in master by turning this off as an
>>>>>>>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>>>>>>>> SPARK-30098.
>>>>>>>>>>
>>>>>>>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>>>>>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>>>>>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>>>>>>>> failures
>>>>>>>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>>>>>>>> where the wrong create path was being used.
>>>>>>>>>>
>>>>>>>>>> Unless we plan to NOT support the behavior
>>>>>>>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>>>>>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>>>>>>>> with this problem for years to come.
>>>>>>>>>>
>>>>>>>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>>>>>>>
>>>>>>>>>>> This seems to be controversial, and can not be done in a short
>>>>>>>>>>> time. It is
>>>>>>>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it
>>>>>>>>>>> in 3.1.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Sent from:
>>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Russell Spitzer <ru...@gmail.com>.

I think that the dead code approach, while a bit unpalatable and worse than
reverting, is probably better than leaving the parameter (even if it is
hidden)

On Tue, May 12, 2020 at 12:46 PM Ryan Blue <rb...@netflix.com> wrote:

> +1 for the approach Jungtaek suggests. That will avoid needing to support
> behavior that is not well understood with minimal changes.
>
> On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Before I forget, we'd better not forget to change the doc, as create
>> table doc looks to represent current syntax which will be incorrect later.
>>
>> On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> It's not only for end users, but also for us. Spark itself uses the
>>> config "true" and "false" in tests and it still brings confusion. We still
>>> have to deal with both situations.
>>>
>>> I'm wondering how long days it would be needed to revert it cleanly, but
>>> if we worry about the amount of code change just around the new RC, what
>>> about make the code dirty (should be fixed soon) but less headache via
>>> applying traditional (and bad) way?
>>>
>>> Let's just remove the config so that the config cannot be used in any
>>> way (even in Spark codebase), and set corresponding field in parser to the
>>> constant value so that no one can modify in any way. This would make the
>>> dead code by intention which should be cleaned it up later, so let's add
>>> FIXME comment there so that anyone can take it up for cleaning up the code
>>> later. (If no one volunteers then I'll probably pick up.)
>>>
>>> That is a bad pattern, but still better as we prevent end users (even
>>> early adopters) go through the undocumented path in any way, and that will
>>> be explicitly marked as "should be fixed". This is different from retaining
>>> config - I don't expect unified create table syntax will be landed in
>>> bugfix version, so even unified create table syntax can be landed in 3.1.0
>>> (this is also not guaranteed) the config will live in 3.0.x in any way. If
>>> we temporarily go dirty way then we can clean up the code in any version,
>>> even from bugfix version, maybe within a couple of weeks just after 3.0.0
>>> is released.
>>>
>>> Does it sound valid?
>>>
>>> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> SPARK-30098 was merged about 6 months ago. It's not a clean revert and
>>>> we may need to spend quite a bit of time to resolve conflicts and fix tests.
>>>>
>>>> I don't see why it's still a problem if a feature is disabled and
>>>> hidden from end-users (it's undocumented, the config is internal). The
>>>> related code will be replaced in the master branch sooner or later, when we
>>>> unify the syntaxes.
>>>>
>>>>
>>>>
>>>> On Tue, May 12, 2020 at 6:16 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> I'm all for getting the unified syntax into master. The only issue
>>>>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>>>>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>>>>> that issue so we're not stuck for another 6 weeks on it.
>>>>>
>>>>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Btw another wondering here is, is it good to retain the flag on
>>>>>> master as an intermediate step? Wouldn't it be better for us to
>>>>>> start "unified create table syntax" from scratch?
>>>>>>
>>>>>>
>>>>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>>>>>>> option 1 because it's less worse than option 2, but it doesn't mean I fully
>>>>>>> agree with option 1.
>>>>>>>
>>>>>>> Let's make below things clear if we really go with option 1,
>>>>>>> otherwise please consider reverting it.
>>>>>>>
>>>>>>> * Do you fully indicate about "all" the paths where the second
>>>>>>> create table syntax is taken?
>>>>>>> * Could you explain "why" to end users without any confusion? Do you
>>>>>>> think end users will understand it easily?
>>>>>>> * Do you have an actual end users to guide to turn this on? Or do
>>>>>>> you have a plan to turn this on for your team/customers and deal with
>>>>>>> the ambiguity?
>>>>>>> * Could you please document about how things will change if the flag
>>>>>>> is turned on?
>>>>>>>
>>>>>>> I guess the option 1 is to leave a flag as "undocumented" one and
>>>>>>> forget about the path to turn on, but I think that would lead to make the
>>>>>>> feature be "broken window" even we are not able to touch.
>>>>>>>
>>>>>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>
>>>>>>>> I think reverting 30098 is the right decision here if we want to
>>>>>>>> unblock 3.0. We shouldn't ship with features which we know do not function
>>>>>>>> in the way we intend, regardless of how little exposure most users have to
>>>>>>>> them. Even if it's off my default, we should probably work to avoid
>>>>>>>> switches that cause things to behave unpredictably or require a flow chart
>>>>>>>> to actually determine what will happen.
>>>>>>>>
>>>>>>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm all for fixing behavior in master by turning this off as an
>>>>>>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>>>>>>> SPARK-30098.
>>>>>>>>>
>>>>>>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>>>>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>>>>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>>>>>>> failures
>>>>>>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>>>>>>> where the wrong create path was being used.
>>>>>>>>>
>>>>>>>>> Unless we plan to NOT support the behavior
>>>>>>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>>>>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>>>>>>> with this problem for years to come.
>>>>>>>>>
>>>>>>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>>>>>>
>>>>>>>>>> This seems to be controversial, and can not be done in a short
>>>>>>>>>> time. It is
>>>>>>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it
>>>>>>>>>> in 3.1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sent from:
>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

+1 for the approach Jungtaek suggests. That will avoid needing to support
behavior that is not well understood with minimal changes.

On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Before I forget, we'd better not forget to change the doc, as create table
> doc looks to represent current syntax which will be incorrect later.
>
> On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> It's not only for end users, but also for us. Spark itself uses the
>> config "true" and "false" in tests and it still brings confusion. We still
>> have to deal with both situations.
>>
>> I'm wondering how long days it would be needed to revert it cleanly, but
>> if we worry about the amount of code change just around the new RC, what
>> about make the code dirty (should be fixed soon) but less headache via
>> applying traditional (and bad) way?
>>
>> Let's just remove the config so that the config cannot be used in any way
>> (even in Spark codebase), and set corresponding field in parser to the
>> constant value so that no one can modify in any way. This would make the
>> dead code by intention which should be cleaned it up later, so let's add
>> FIXME comment there so that anyone can take it up for cleaning up the code
>> later. (If no one volunteers then I'll probably pick up.)
>>
>> That is a bad pattern, but still better as we prevent end users (even
>> early adopters) go through the undocumented path in any way, and that will
>> be explicitly marked as "should be fixed". This is different from retaining
>> config - I don't expect unified create table syntax will be landed in
>> bugfix version, so even unified create table syntax can be landed in 3.1.0
>> (this is also not guaranteed) the config will live in 3.0.x in any way. If
>> we temporarily go dirty way then we can clean up the code in any version,
>> even from bugfix version, maybe within a couple of weeks just after 3.0.0
>> is released.
>>
>> Does it sound valid?
>>
>> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> SPARK-30098 was merged about 6 months ago. It's not a clean revert and
>>> we may need to spend quite a bit of time to resolve conflicts and fix tests.
>>>
>>> I don't see why it's still a problem if a feature is disabled and hidden
>>> from end-users (it's undocumented, the config is internal). The related
>>> code will be replaced in the master branch sooner or later, when we unify
>>> the syntaxes.
>>>
>>>
>>>
>>> On Tue, May 12, 2020 at 6:16 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I'm all for getting the unified syntax into master. The only issue
>>>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>>>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>>>> that issue so we're not stuck for another 6 weeks on it.
>>>>
>>>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> Btw another wondering here is, is it good to retain the flag on master
>>>>> as an intermediate step? Wouldn't it be better for us to start "unified
>>>>> create table syntax" from scratch?
>>>>>
>>>>>
>>>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>>>>>> option 1 because it's less worse than option 2, but it doesn't mean I fully
>>>>>> agree with option 1.
>>>>>>
>>>>>> Let's make below things clear if we really go with option 1,
>>>>>> otherwise please consider reverting it.
>>>>>>
>>>>>> * Do you fully indicate about "all" the paths where the second create
>>>>>> table syntax is taken?
>>>>>> * Could you explain "why" to end users without any confusion? Do you
>>>>>> think end users will understand it easily?
>>>>>> * Do you have an actual end users to guide to turn this on? Or do you
>>>>>> have a plan to turn this on for your team/customers and deal with
>>>>>> the ambiguity?
>>>>>> * Could you please document about how things will change if the flag
>>>>>> is turned on?
>>>>>>
>>>>>> I guess the option 1 is to leave a flag as "undocumented" one and
>>>>>> forget about the path to turn on, but I think that would lead to make the
>>>>>> feature be "broken window" even we are not able to touch.
>>>>>>
>>>>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>
>>>>>>> I think reverting 30098 is the right decision here if we want to
>>>>>>> unblock 3.0. We shouldn't ship with features which we know do not function
>>>>>>> in the way we intend, regardless of how little exposure most users have to
>>>>>>> them. Even if it's off my default, we should probably work to avoid
>>>>>>> switches that cause things to behave unpredictably or require a flow chart
>>>>>>> to actually determine what will happen.
>>>>>>>
>>>>>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm all for fixing behavior in master by turning this off as an
>>>>>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>>>>>> SPARK-30098.
>>>>>>>>
>>>>>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>>>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>>>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>>>>>> failures
>>>>>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>>>>>> where the wrong create path was being used.
>>>>>>>>
>>>>>>>> Unless we plan to NOT support the behavior
>>>>>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>>>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>>>>>> with this problem for years to come.
>>>>>>>>
>>>>>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>>>>>>>
>>>>>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>>>>>
>>>>>>>>> This seems to be controversial, and can not be done in a short
>>>>>>>>> time. It is
>>>>>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it
>>>>>>>>> in 3.1.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sent from:
>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Before I forget, we'd better not forget to change the doc, as create table
doc looks to represent current syntax which will be incorrect later.

On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> It's not only for end users, but also for us. Spark itself uses the config
> "true" and "false" in tests and it still brings confusion. We still have to
> deal with both situations.
>
> I'm wondering how long days it would be needed to revert it cleanly, but
> if we worry about the amount of code change just around the new RC, what
> about make the code dirty (should be fixed soon) but less headache via
> applying traditional (and bad) way?
>
> Let's just remove the config so that the config cannot be used in any way
> (even in Spark codebase), and set corresponding field in parser to the
> constant value so that no one can modify in any way. This would make the
> dead code by intention which should be cleaned it up later, so let's add
> FIXME comment there so that anyone can take it up for cleaning up the code
> later. (If no one volunteers then I'll probably pick up.)
>
> That is a bad pattern, but still better as we prevent end users (even
> early adopters) go through the undocumented path in any way, and that will
> be explicitly marked as "should be fixed". This is different from retaining
> config - I don't expect unified create table syntax will be landed in
> bugfix version, so even unified create table syntax can be landed in 3.1.0
> (this is also not guaranteed) the config will live in 3.0.x in any way. If
> we temporarily go dirty way then we can clean up the code in any version,
> even from bugfix version, maybe within a couple of weeks just after 3.0.0
> is released.
>
> Does it sound valid?
>
> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> SPARK-30098 was merged about 6 months ago. It's not a clean revert and we
>> may need to spend quite a bit of time to resolve conflicts and fix tests.
>>
>> I don't see why it's still a problem if a feature is disabled and hidden
>> from end-users (it's undocumented, the config is internal). The related
>> code will be replaced in the master branch sooner or later, when we unify
>> the syntaxes.
>>
>>
>>
>> On Tue, May 12, 2020 at 6:16 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I'm all for getting the unified syntax into master. The only issue
>>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>>> that issue so we're not stuck for another 6 weeks on it.
>>>
>>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Btw another wondering here is, is it good to retain the flag on master
>>>> as an intermediate step? Wouldn't it be better for us to start "unified
>>>> create table syntax" from scratch?
>>>>
>>>>
>>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>>>>> option 1 because it's less worse than option 2, but it doesn't mean I fully
>>>>> agree with option 1.
>>>>>
>>>>> Let's make below things clear if we really go with option 1, otherwise
>>>>> please consider reverting it.
>>>>>
>>>>> * Do you fully indicate about "all" the paths where the second create
>>>>> table syntax is taken?
>>>>> * Could you explain "why" to end users without any confusion? Do you
>>>>> think end users will understand it easily?
>>>>> * Do you have an actual end users to guide to turn this on? Or do you
>>>>> have a plan to turn this on for your team/customers and deal with
>>>>> the ambiguity?
>>>>> * Could you please document about how things will change if the flag
>>>>> is turned on?
>>>>>
>>>>> I guess the option 1 is to leave a flag as "undocumented" one and
>>>>> forget about the path to turn on, but I think that would lead to make the
>>>>> feature be "broken window" even we are not able to touch.
>>>>>
>>>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>>>> russell.spitzer@gmail.com> wrote:
>>>>>
>>>>>> I think reverting 30098 is the right decision here if we want to
>>>>>> unblock 3.0. We shouldn't ship with features which we know do not function
>>>>>> in the way we intend, regardless of how little exposure most users have to
>>>>>> them. Even if it's off my default, we should probably work to avoid
>>>>>> switches that cause things to behave unpredictably or require a flow chart
>>>>>> to actually determine what will happen.
>>>>>>
>>>>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm all for fixing behavior in master by turning this off as an
>>>>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>>>>> SPARK-30098.
>>>>>>>
>>>>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>>>>> failures
>>>>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>>>>> where the wrong create path was being used.
>>>>>>>
>>>>>>> Unless we plan to NOT support the behavior
>>>>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>>>>> with this problem for years to come.
>>>>>>>
>>>>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>>>>>>
>>>>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>>>>
>>>>>>>> This seems to be controversial, and can not be done in a short
>>>>>>>> time. It is
>>>>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in
>>>>>>>> 3.1.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sent from:
>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

It's not only for end users, but also for us. Spark itself uses the config
"true" and "false" in tests and it still brings confusion. We still have to
deal with both situations.

I'm wondering how long days it would be needed to revert it cleanly, but if
we worry about the amount of code change just around the new RC, what about
make the code dirty (should be fixed soon) but less headache via applying
traditional (and bad) way?

Let's just remove the config so that the config cannot be used in any way
(even in Spark codebase), and set corresponding field in parser to the
constant value so that no one can modify in any way. This would make the
dead code by intention which should be cleaned it up later, so let's add
FIXME comment there so that anyone can take it up for cleaning up the code
later. (If no one volunteers then I'll probably pick up.)

That is a bad pattern, but still better as we prevent end users (even early
adopters) go through the undocumented path in any way, and that will be
explicitly marked as "should be fixed". This is different from retaining
config - I don't expect unified create table syntax will be landed in
bugfix version, so even unified create table syntax can be landed in 3.1.0
(this is also not guaranteed) the config will live in 3.0.x in any way. If
we temporarily go dirty way then we can clean up the code in any version,
even from bugfix version, maybe within a couple of weeks just after 3.0.0
is released.

Does it sound valid?

On Tue, May 12, 2020 at 2:35 PM Wenchen Fan <cl...@gmail.com> wrote:

> SPARK-30098 was merged about 6 months ago. It's not a clean revert and we
> may need to spend quite a bit of time to resolve conflicts and fix tests.
>
> I don't see why it's still a problem if a feature is disabled and hidden
> from end-users (it's undocumented, the config is internal). The related
> code will be replaced in the master branch sooner or later, when we unify
> the syntaxes.
>
>
>
> On Tue, May 12, 2020 at 6:16 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I'm all for getting the unified syntax into master. The only issue
>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>> that issue so we're not stuck for another 6 weeks on it.
>>
>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Btw another wondering here is, is it good to retain the flag on master
>>> as an intermediate step? Wouldn't it be better for us to start "unified
>>> create table syntax" from scratch?
>>>
>>>
>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>>>> option 1 because it's less worse than option 2, but it doesn't mean I fully
>>>> agree with option 1.
>>>>
>>>> Let's make below things clear if we really go with option 1, otherwise
>>>> please consider reverting it.
>>>>
>>>> * Do you fully indicate about "all" the paths where the second create
>>>> table syntax is taken?
>>>> * Could you explain "why" to end users without any confusion? Do you
>>>> think end users will understand it easily?
>>>> * Do you have an actual end users to guide to turn this on? Or do you
>>>> have a plan to turn this on for your team/customers and deal with
>>>> the ambiguity?
>>>> * Could you please document about how things will change if the flag is
>>>> turned on?
>>>>
>>>> I guess the option 1 is to leave a flag as "undocumented" one and
>>>> forget about the path to turn on, but I think that would lead to make the
>>>> feature be "broken window" even we are not able to touch.
>>>>
>>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>>> russell.spitzer@gmail.com> wrote:
>>>>
>>>>> I think reverting 30098 is the right decision here if we want to
>>>>> unblock 3.0. We shouldn't ship with features which we know do not function
>>>>> in the way we intend, regardless of how little exposure most users have to
>>>>> them. Even if it's off my default, we should probably work to avoid
>>>>> switches that cause things to behave unpredictably or require a flow chart
>>>>> to actually determine what will happen.
>>>>>
>>>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I'm all for fixing behavior in master by turning this off as an
>>>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>>>> SPARK-30098.
>>>>>>
>>>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>>>> failures
>>>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>>>> where the wrong create path was being used.
>>>>>>
>>>>>> Unless we plan to NOT support the behavior
>>>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>>>> with this problem for years to come.
>>>>>>
>>>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>>>>>
>>>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>>>
>>>>>>> This seems to be controversial, and can not be done in a short time.
>>>>>>> It is
>>>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in
>>>>>>> 3.1.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from:
>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Wenchen Fan <cl...@gmail.com>.

SPARK-30098 was merged about 6 months ago. It's not a clean revert and we
may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden
from end-users (it's undocumented, the config is internal). The related
code will be replaced in the master branch sooner or later, when we unify
the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I'm all for getting the unified syntax into master. The only issue appears
> to be whether or not to pass the presence of the EXTERNAL keyword through
> to a catalog in v2. Maybe it's time to start a discuss thread for that
> issue so we're not stuck for another 6 weeks on it.
>
> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Btw another wondering here is, is it good to retain the flag on master as
>> an intermediate step? Wouldn't it be better for us to start "unified create
>> table syntax" from scratch?
>>
>>
>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the option
>>> 1 because it's less worse than option 2, but it doesn't mean I fully agree
>>> with option 1.
>>>
>>> Let's make below things clear if we really go with option 1, otherwise
>>> please consider reverting it.
>>>
>>> * Do you fully indicate about "all" the paths where the second create
>>> table syntax is taken?
>>> * Could you explain "why" to end users without any confusion? Do you
>>> think end users will understand it easily?
>>> * Do you have an actual end users to guide to turn this on? Or do you
>>> have a plan to turn this on for your team/customers and deal with
>>> the ambiguity?
>>> * Could you please document about how things will change if the flag is
>>> turned on?
>>>
>>> I guess the option 1 is to leave a flag as "undocumented" one and forget
>>> about the path to turn on, but I think that would lead to make the
>>> feature be "broken window" even we are not able to touch.
>>>
>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> I think reverting 30098 is the right decision here if we want to
>>>> unblock 3.0. We shouldn't ship with features which we know do not function
>>>> in the way we intend, regardless of how little exposure most users have to
>>>> them. Even if it's off my default, we should probably work to avoid
>>>> switches that cause things to behave unpredictably or require a flow chart
>>>> to actually determine what will happen.
>>>>
>>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> I'm all for fixing behavior in master by turning this off as an
>>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>>> SPARK-30098.
>>>>>
>>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>>> failures
>>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>>> where the wrong create path was being used.
>>>>>
>>>>> Unless we plan to NOT support the behavior
>>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>>> with this problem for years to come.
>>>>>
>>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>>>>
>>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>>
>>>>>> This seems to be controversial, and can not be done in a short time.
>>>>>> It is
>>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in
>>>>>> 3.1.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I'm all for getting the unified syntax into master. The only issue appears
to be whether or not to pass the presence of the EXTERNAL keyword through
to a catalog in v2. Maybe it's time to start a discuss thread for that
issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Btw another wondering here is, is it good to retain the flag on master as
> an intermediate step? Wouldn't it be better for us to start "unified create
> table syntax" from scratch?
>
>
> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> I'm sorry, but I have to agree with Ryan and Russell. I chose the option
>> 1 because it's less worse than option 2, but it doesn't mean I fully agree
>> with option 1.
>>
>> Let's make below things clear if we really go with option 1, otherwise
>> please consider reverting it.
>>
>> * Do you fully indicate about "all" the paths where the second create
>> table syntax is taken?
>> * Could you explain "why" to end users without any confusion? Do you
>> think end users will understand it easily?
>> * Do you have an actual end users to guide to turn this on? Or do you
>> have a plan to turn this on for your team/customers and deal with
>> the ambiguity?
>> * Could you please document about how things will change if the flag is
>> turned on?
>>
>> I guess the option 1 is to leave a flag as "undocumented" one and forget
>> about the path to turn on, but I think that would lead to make the
>> feature be "broken window" even we are not able to touch.
>>
>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>> russell.spitzer@gmail.com> wrote:
>>
>>> I think reverting 30098 is the right decision here if we want to unblock
>>> 3.0. We shouldn't ship with features which we know do not function in the
>>> way we intend, regardless of how little exposure most users have to them.
>>> Even if it's off my default, we should probably work to avoid switches that
>>> cause things to behave unpredictably or require a flow chart to actually
>>> determine what will happen.
>>>
>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I'm all for fixing behavior in master by turning this off as an
>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>> SPARK-30098.
>>>>
>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>> failures
>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>> where the wrong create path was being used.
>>>>
>>>> Unless we plan to NOT support the behavior
>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>> with this problem for years to come.
>>>>
>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>>>
>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>
>>>>> This seems to be controversial, and can not be done in a short time.
>>>>> It is
>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in
>>>>> 3.1.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Btw another wondering here is, is it good to retain the flag on master as
an intermediate step? Wouldn't it be better for us to start "unified create
table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1
> because it's less worse than option 2, but it doesn't mean I fully agree
> with option 1.
>
> Let's make below things clear if we really go with option 1, otherwise
> please consider reverting it.
>
> * Do you fully indicate about "all" the paths where the second create
> table syntax is taken?
> * Could you explain "why" to end users without any confusion? Do you think
> end users will understand it easily?
> * Do you have an actual end users to guide to turn this on? Or do you have
> a plan to turn this on for your team/customers and deal with the ambiguity?
> * Could you please document about how things will change if the flag is
> turned on?
>
> I guess the option 1 is to leave a flag as "undocumented" one and forget
> about the path to turn on, but I think that would lead to make the
> feature be "broken window" even we are not able to touch.
>
> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> I think reverting 30098 is the right decision here if we want to unblock
>> 3.0. We shouldn't ship with features which we know do not function in the
>> way we intend, regardless of how little exposure most users have to them.
>> Even if it's off my default, we should probably work to avoid switches that
>> cause things to behave unpredictably or require a flow chart to actually
>> determine what will happen.
>>
>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I'm all for fixing behavior in master by turning this off as an
>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>> SPARK-30098.
>>>
>>> The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
>>> pointed out. And that behavior is not fully understood. While working on a
>>> unified CREATE TABLE syntax, I hit additional test failures
>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>> where the wrong create path was being used.
>>>
>>> Unless we plan to NOT support the behavior
>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>> with this problem for years to come.
>>>
>>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>>
>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>
>>>> This seems to be controversial, and can not be done in a short time. It
>>>> is
>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1
because it's less worse than option 2, but it doesn't mean I fully agree
with option 1.

Let's make below things clear if we really go with option 1, otherwise
please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table
syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think
end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have
a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is
turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget
about the path to turn on, but I think that would lead to make the
feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <ru...@gmail.com>
wrote:

> I think reverting 30098 is the right decision here if we want to unblock
> 3.0. We shouldn't ship with features which we know do not function in the
> way we intend, regardless of how little exposure most users have to them.
> Even if it's off my default, we should probably work to avoid switches that
> cause things to behave unpredictably or require a flow chart to actually
> determine what will happen.
>
> On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I'm all for fixing behavior in master by turning this off as an
>> intermediate step, but I don't think that Spark 3.0 can safely include
>> SPARK-30098.
>>
>> The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
>> pointed out. And that behavior is not fully understood. While working on a
>> unified CREATE TABLE syntax, I hit additional test failures
>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>> where the wrong create path was being used.
>>
>> Unless we plan to NOT support the behavior
>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>> with this problem for years to come.
>>
>> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>>
>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>
>>> This seems to be controversial, and can not be done in a short time. It
>>> is
>>> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Russell Spitzer <ru...@gmail.com>.

I think reverting 30098 is the right decision here if we want to unblock
3.0. We shouldn't ship with features which we know do not function in the
way we intend, regardless of how little exposure most users have to them.
Even if it's off my default, we should probably work to avoid switches that
cause things to behave unpredictably or require a flow chart to actually
determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I'm all for fixing behavior in master by turning this off as an
> intermediate step, but I don't think that Spark 3.0 can safely include
> SPARK-30098.
>
> The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
> pointed out. And that behavior is not fully understood. While working on a
> unified CREATE TABLE syntax, I hit additional test failures
> <https://github.com/apache/spark/pull/28026#issuecomment-606967363> where
> the wrong create path was being used.
>
> Unless we plan to NOT support the behavior
> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
> with this problem for years to come.
>
> On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:
>
>> +1. Agree with Xiao Li and Jungtaek Lim.
>>
>> This seems to be controversial, and can not be done in a short time. It is
>> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I'm all for fixing behavior in master by turning this off as an
intermediate step, but I don't think that Spark 3.0 can safely include
SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
pointed out. And that behavior is not fully understood. While working on a
unified CREATE TABLE syntax, I hit additional test failures
<https://github.com/apache/spark/pull/28026#issuecomment-606967363> where
the wrong create path was being used.

Unless we plan to NOT support the behavior
when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee <qc...@163.com> wrote:

> +1. Agree with Xiao Li and Jungtaek Lim.
>
> This seems to be controversial, and can not be done in a short time. It is
> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by JackyLee <qc...@163.com>.

+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Here's a WIP PR with the basic changes:
https://github.com/apache/spark/pull/28026

I still need to update tests in that branch and add the conversions to the
old Hive plans. But at least you can see how the parser part works and how
I'm converting the extra clauses for DSv2. This also enables us to support
Hive create syntax in DSv2.

On Wed, Mar 25, 2020 at 3:59 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Would it be better to prioritize this to make sure the change is included
> in Spark 3.0? (Maybe filing an issue and set as a blocker)
>
> Looks like there's consensus that SPARK-30098 brought ambiguous issue
> which should be fixed (though the consideration of severity seems to be
> different), and once we notice the issue it would be really odd if we
> publish it as it is, and try to fix it later (the fix may not be even
> included in 3.0.x as it might bring behavioral change).
>
> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> It's great to hear that you are cleaning up this long-standing mess.
>> Please let me know if you hit any problems that I can help with.
>>
>> Thanks,
>> Wenchen
>>
>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>> BY transformList.
>>>>
>>>
>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>> TABLE syntax, we can also update Catalog.createTable() to support
>>> creating partitioned tables
>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Xiao Li <li...@databricks.com>.

>
> 1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default,
> which effectively revert SPARK-30098. The CREATE TABLE syntax is still
> confusing but it's the same as 2.4
> 2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is
> specified. This gives us more time to think about how to do it in 3.1.
>

I prefer to first turn on *spark.sql.legacy.createHiveTableByDefault.*
*enabled* by default and then start RC2 first.

We still can continue trying option 2, if we can finish it within 10
days. BTW, we still have multiple ongoing discussions about data source v2
APIs. To be honest, most Spark users will not hit these cases in Spark 3.0.
Thus, temporarily blocking a few cases in DSV2 looks reasonable to me. We
can support them in Spark 3.1.

Xiao





On Sun, May 10, 2020 at 9:32 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Let's focus on how to unblock Spark 3.0.0 for now, as other blockers are
> getting resolved.
>
> I'm in favor of option 1 to avoid bring multiple backward incompatible
> changes. Unifying create table would bring backward incompatibility (I'd
> rather say the new syntax should be cleared up ignoring the backward
> compatibility) and we'd be better to not force end users to adopt the
> changes twice.
>
> On Fri, May 8, 2020 at 11:22 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'd like to bring this up again to share the status and get more
>> feedback. Currently, we all agree to unify the CREATE TABLE syntax by
>> merging the native and Hive-style syntaxes.
>>
>> The unified CREATE TABLE syntax will become the native syntax and there
>> is no Hive-style syntax anymore. This brings several changes:
>> 1. support PARTITION BY (col type, ...). This can't co-exist with PARTITION
>> BY (col, ...), and simply adds partition columns to the end.
>> 2. support SKEWED BY, which just fails
>> 3. support STORE AS/BY, which can't co-exist with USING provider
>> 4. support EXTERNAL as well
>>
>> All the behaviors will remain the same as before, for the builtin
>> catalog. However, the native CREATE TABLE syntax needs to support the v2
>> CreateTable command and we need to translate the new syntax changes to
>> catalog plugin API calls, and we are still working on reaching an agreement
>> about how to do it.
>>
>> To unblock 3.0, I think there are two choices:
>> 1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default,
>> which effectively revert SPARK-30098. The CREATE TABLE syntax is still
>> confusing but it's the same as 2.4
>> 2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is
>> specified. This gives us more time to think about how to do it in 3.1.
>>
>> If you have other ideas, please reply to this thread.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Thanks, filed SPARK-31257
>>> <https://issues.apache.org/jira/browse/SPARK-31257>. Thanks again for
>>> looking into this - I'll take a look whenever I get time sooner.
>>>
>>> On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Feel free to open another issue, I just used that one since it
>>>> describes this and doesn't appear to be done.
>>>>
>>>> On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> UPDATE: Sorry I just missed the PR (
>>>>> https://github.com/apache/spark/pull/28026). I still think it'd be
>>>>> nice to avoid recycling the JIRA issue which was resolved before. Shall we
>>>>> have a new JIRA issue with linking to SPARK-30098, and set proper priority?
>>>>>
>>>>> On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Would it be better to prioritize this to make sure the change is
>>>>>> included in Spark 3.0? (Maybe filing an issue and set as a blocker)
>>>>>>
>>>>>> Looks like there's consensus that SPARK-30098 brought ambiguous issue
>>>>>> which should be fixed (though the consideration of severity seems to be
>>>>>> different), and once we notice the issue it would be really odd if we
>>>>>> publish it as it is, and try to fix it later (the fix may not be even
>>>>>> included in 3.0.x as it might bring behavioral change).
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> It's great to hear that you are cleaning up this long-standing mess.
>>>>>>> Please let me know if you hit any problems that I can help with.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Wenchen
>>>>>>>
>>>>>>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>
>>>>>>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>>>>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>>>>>>> BY transformList.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Another side note: Perhaps as part of (or after) unifying the
>>>>>>>> CREATE TABLE syntax, we can also update Catalog.createTable() to
>>>>>>>> support creating partitioned tables
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Let's focus on how to unblock Spark 3.0.0 for now, as other blockers are
getting resolved.

I'm in favor of option 1 to avoid bring multiple backward incompatible
changes. Unifying create table would bring backward incompatibility (I'd
rather say the new syntax should be cleared up ignoring the backward
compatibility) and we'd be better to not force end users to adopt the
changes twice.

On Fri, May 8, 2020 at 11:22 PM Wenchen Fan <cl...@gmail.com> wrote:

> Hi all,
>
> I'd like to bring this up again to share the status and get more feedback.
> Currently, we all agree to unify the CREATE TABLE syntax by merging the
> native and Hive-style syntaxes.
>
> The unified CREATE TABLE syntax will become the native syntax and there is
> no Hive-style syntax anymore. This brings several changes:
> 1. support PARTITION BY (col type, ...). This can't co-exist with PARTITION
> BY (col, ...), and simply adds partition columns to the end.
> 2. support SKEWED BY, which just fails
> 3. support STORE AS/BY, which can't co-exist with USING provider
> 4. support EXTERNAL as well
>
> All the behaviors will remain the same as before, for the builtin catalog.
> However, the native CREATE TABLE syntax needs to support the v2 CreateTable
> command and we need to translate the new syntax changes to catalog plugin
> API calls, and we are still working on reaching an agreement about how to
> do it.
>
> To unblock 3.0, I think there are two choices:
> 1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default,
> which effectively revert SPARK-30098. The CREATE TABLE syntax is still
> confusing but it's the same as 2.4
> 2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is
> specified. This gives us more time to think about how to do it in 3.1.
>
> If you have other ideas, please reply to this thread.
>
> Thanks,
> Wenchen
>
> On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Thanks, filed SPARK-31257
>> <https://issues.apache.org/jira/browse/SPARK-31257>. Thanks again for
>> looking into this - I'll take a look whenever I get time sooner.
>>
>> On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Feel free to open another issue, I just used that one since it describes
>>> this and doesn't appear to be done.
>>>
>>> On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> UPDATE: Sorry I just missed the PR (
>>>> https://github.com/apache/spark/pull/28026). I still think it'd be
>>>> nice to avoid recycling the JIRA issue which was resolved before. Shall we
>>>> have a new JIRA issue with linking to SPARK-30098, and set proper priority?
>>>>
>>>> On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> Would it be better to prioritize this to make sure the change is
>>>>> included in Spark 3.0? (Maybe filing an issue and set as a blocker)
>>>>>
>>>>> Looks like there's consensus that SPARK-30098 brought ambiguous issue
>>>>> which should be fixed (though the consideration of severity seems to be
>>>>> different), and once we notice the issue it would be really odd if we
>>>>> publish it as it is, and try to fix it later (the fix may not be even
>>>>> included in 3.0.x as it might bring behavioral change).
>>>>>
>>>>> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> It's great to hear that you are cleaning up this long-standing mess.
>>>>>> Please let me know if you hit any problems that I can help with.
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>
>>>>>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>>>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>>>>>> BY transformList.
>>>>>>>>
>>>>>>>
>>>>>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>>>>>> TABLE syntax, we can also update Catalog.createTable() to support
>>>>>>> creating partitioned tables
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Wenchen Fan <cl...@gmail.com>.

Hi all,

I'd like to bring this up again to share the status and get more feedback.
Currently, we all agree to unify the CREATE TABLE syntax by merging the
native and Hive-style syntaxes.

The unified CREATE TABLE syntax will become the native syntax and there is
no Hive-style syntax anymore. This brings several changes:
1. support PARTITION BY (col type, ...). This can't co-exist with PARTITION
BY (col, ...), and simply adds partition columns to the end.
2. support SKEWED BY, which just fails
3. support STORE AS/BY, which can't co-exist with USING provider
4. support EXTERNAL as well

All the behaviors will remain the same as before, for the builtin catalog.
However, the native CREATE TABLE syntax needs to support the v2 CreateTable
command and we need to translate the new syntax changes to catalog plugin
API calls, and we are still working on reaching an agreement about how to
do it.

To unblock 3.0, I think there are two choices:
1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default,
which effectively revert SPARK-30098. The CREATE TABLE syntax is still
confusing but it's the same as 2.4
2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is
specified. This gives us more time to think about how to do it in 3.1.

If you have other ideas, please reply to this thread.

Thanks,
Wenchen

On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Thanks, filed SPARK-31257
> <https://issues.apache.org/jira/browse/SPARK-31257>. Thanks again for
> looking into this - I'll take a look whenever I get time sooner.
>
> On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> Feel free to open another issue, I just used that one since it describes
>> this and doesn't appear to be done.
>>
>> On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> UPDATE: Sorry I just missed the PR (
>>> https://github.com/apache/spark/pull/28026). I still think it'd be nice
>>> to avoid recycling the JIRA issue which was resolved before. Shall we have
>>> a new JIRA issue with linking to SPARK-30098, and set proper priority?
>>>
>>> On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Would it be better to prioritize this to make sure the change is
>>>> included in Spark 3.0? (Maybe filing an issue and set as a blocker)
>>>>
>>>> Looks like there's consensus that SPARK-30098 brought ambiguous issue
>>>> which should be fixed (though the consideration of severity seems to be
>>>> different), and once we notice the issue it would be really odd if we
>>>> publish it as it is, and try to fix it later (the fix may not be even
>>>> included in 3.0.x as it might bring behavioral change).
>>>>
>>>> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> It's great to hear that you are cleaning up this long-standing mess.
>>>>> Please let me know if you hit any problems that I can help with.
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>>>>> BY transformList.
>>>>>>>
>>>>>>
>>>>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>>>>> TABLE syntax, we can also update Catalog.createTable() to support
>>>>>> creating partitioned tables
>>>>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>>>>
>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Thanks, filed SPARK-31257
<https://issues.apache.org/jira/browse/SPARK-31257>. Thanks again for
looking into this - I'll take a look whenever I get time sooner.

On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue <rb...@netflix.com> wrote:

> Feel free to open another issue, I just used that one since it describes
> this and doesn't appear to be done.
>
> On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> UPDATE: Sorry I just missed the PR (
>> https://github.com/apache/spark/pull/28026). I still think it'd be nice
>> to avoid recycling the JIRA issue which was resolved before. Shall we have
>> a new JIRA issue with linking to SPARK-30098, and set proper priority?
>>
>> On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Would it be better to prioritize this to make sure the change is
>>> included in Spark 3.0? (Maybe filing an issue and set as a blocker)
>>>
>>> Looks like there's consensus that SPARK-30098 brought ambiguous issue
>>> which should be fixed (though the consideration of severity seems to be
>>> different), and once we notice the issue it would be really odd if we
>>> publish it as it is, and try to fix it later (the fix may not be even
>>> included in 3.0.x as it might bring behavioral change).
>>>
>>> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> It's great to hear that you are cleaning up this long-standing mess.
>>>> Please let me know if you hit any problems that I can help with.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>>>> nicholas.chammas@gmail.com> wrote:
>>>>
>>>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>>>> BY transformList.
>>>>>>
>>>>>
>>>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>>>> TABLE syntax, we can also update Catalog.createTable() to support
>>>>> creating partitioned tables
>>>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Feel free to open another issue, I just used that one since it describes
this and doesn't appear to be done.

On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> UPDATE: Sorry I just missed the PR (
> https://github.com/apache/spark/pull/28026). I still think it'd be nice
> to avoid recycling the JIRA issue which was resolved before. Shall we have
> a new JIRA issue with linking to SPARK-30098, and set proper priority?
>
> On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Would it be better to prioritize this to make sure the change is included
>> in Spark 3.0? (Maybe filing an issue and set as a blocker)
>>
>> Looks like there's consensus that SPARK-30098 brought ambiguous issue
>> which should be fixed (though the consideration of severity seems to be
>> different), and once we notice the issue it would be really odd if we
>> publish it as it is, and try to fix it later (the fix may not be even
>> included in 3.0.x as it might bring behavioral change).
>>
>> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> Hi Ryan,
>>>
>>> It's great to hear that you are cleaning up this long-standing mess.
>>> Please let me know if you hit any problems that I can help with.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>>> BY transformList.
>>>>>
>>>>
>>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>>> TABLE syntax, we can also update Catalog.createTable() to support
>>>> creating partitioned tables
>>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

UPDATE: Sorry I just missed the PR (
https://github.com/apache/spark/pull/28026). I still think it'd be nice to
avoid recycling the JIRA issue which was resolved before. Shall we have a
new JIRA issue with linking to SPARK-30098, and set proper priority?

On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Would it be better to prioritize this to make sure the change is included
> in Spark 3.0? (Maybe filing an issue and set as a blocker)
>
> Looks like there's consensus that SPARK-30098 brought ambiguous issue
> which should be fixed (though the consideration of severity seems to be
> different), and once we notice the issue it would be really odd if we
> publish it as it is, and try to fix it later (the fix may not be even
> included in 3.0.x as it might bring behavioral change).
>
> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> It's great to hear that you are cleaning up this long-standing mess.
>> Please let me know if you hit any problems that I can help with.
>>
>> Thanks,
>> Wenchen
>>
>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>> BY transformList.
>>>>
>>>
>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>> TABLE syntax, we can also update Catalog.createTable() to support
>>> creating partitioned tables
>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>
>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Would it be better to prioritize this to make sure the change is included
in Spark 3.0? (Maybe filing an issue and set as a blocker)

Looks like there's consensus that SPARK-30098 brought ambiguous issue which
should be fixed (though the consideration of severity seems to be
different), and once we notice the issue it would be really odd if we
publish it as it is, and try to fix it later (the fix may not be even
included in 3.0.x as it might bring behavioral change).

On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan <cl...@gmail.com> wrote:

> Hi Ryan,
>
> It's great to hear that you are cleaning up this long-standing mess.
> Please let me know if you hit any problems that I can help with.
>
> Thanks,
> Wenchen
>
> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
>>> syntax. Just make sure it doesn't appear together with PARTITIONED BY
>>> transformList.
>>>
>>
>> Another side note: Perhaps as part of (or after) unifying the CREATE
>> TABLE syntax, we can also update Catalog.createTable() to support
>> creating partitioned tables
>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Wenchen Fan <cl...@gmail.com>.

Hi Ryan,

It's great to hear that you are cleaning up this long-standing mess. Please
let me know if you hit any problems that I can help with.

Thanks,
Wenchen

On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
>> syntax. Just make sure it doesn't appear together with PARTITIONED BY
>> transformList.
>>
>
> Another side note: Perhaps as part of (or after) unifying the CREATE TABLE
> syntax, we can also update Catalog.createTable() to support creating
> partitioned tables <https://issues.apache.org/jira/browse/SPARK-31001>.
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Nicholas Chammas <ni...@gmail.com>.

On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan <cl...@gmail.com> wrote:

> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
> syntax. Just make sure it doesn't appear together with PARTITIONED BY
> transformList.
>

Another side note: Perhaps as part of (or after) unifying the CREATE TABLE
syntax, we can also update Catalog.createTable() to support creating
partitioned tables <https://issues.apache.org/jira/browse/SPARK-31001>.

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Wenchen Fan <cl...@gmail.com>.

Big +1 to have one single unified CREATE TABLE syntax.

In general, we can say there are 2 ways to specify the table provider:
USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
exclusive. If none is specified, it implicitly indicates USING defaultSource
.

I'm fine with a few special cases which can indicate the table provider,
like EXTERNAL indicates Hive Serde table. A few thoughts:
1. SKEWED BY ...: We support it in Hive syntax just to fail it with a nice
error message. We can support it in the unified syntax as well, and fail it.
2. PARTITIONED BY colTypeList: I think we can support it in the unified
syntax. Just make sure it doesn't appear together with PARTITIONED BY
transformList.
3. OPTIONS: We can either map it to Hive Serde properties, or let it
indicate non-Hive tables.

On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
> may add the confusion.
>
> Ryan, thanks for the detailed analysis and proposal. That's what I would
> like to see in discussion thread.
>
> I'm open to solutions which enable end users to specify their intention
> properly - my main concern of SPARK-30098 is that it becomes unclear which
> provider the query will use in create table unless USING provider is
> explicitly specified. If the new proposal makes clear on this, that should
> be better than now.
>
> Replying inline:
>
> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Side comment: The current docs for CREATE TABLE
>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
>> add to the confusion by describing the Hive-compatible command as "CREATE
>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>> actually part of the syntax
>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
>> .
>>
>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Jungtaek, it sounds like you consider the two rules to be separate
>>> syntaxes with their own consistency rules. For example, if I am using the
>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>> columns and requires types for those columns; if I’m using the Spark syntax
>>> rule with USING then PARTITIONED BY must reference existing columns and
>>> cannot include types.
>>>
>>> I agree that this is confusing to users! We should fix it, but I don’t
>>> think the right solution is to continue to have two rules with divergent
>>> syntax.
>>>
>>> This is confusing to users because they don’t know anything about
>>> separate parser rules. All the user sees is that sometimes PARTITION BY
>>> requires types and sometimes it doesn’t. Yes, we could add a keyword,
>>> HIVE, to signal that the syntax is borrowed from Hive for that case,
>>> but that actually breaks queries that run in Hive.
>>>
>> That might less matter, because SPARK-30098 (and I guess your proposal as
> well) enforces end users to add "USING HIVE" for their queries to enable
> Hive provider in any way, even only when the query matches with rule 1
> (conditional). Once they decide to create Hive table, the query might have
> to be changed, or they have to change the default provider, or they have to
> enable legacy config.
>
>
>> I think the right solution is to unify the two syntaxes. I don’t think
>>> they are so different that it isn’t possible. Here are the differences I
>>> see:
>>>
>>>    - Only in Hive:
>>>       - EXTERNAL
>>>       - skewSpec: SKEWED BY ...
>>>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>>       - createFileFormat: STORED AS ...
>>>    - Only in Spark:
>>>       - OPTIONS property list
>>>    - Different syntax/interpretation:
>>>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>>
>>> ":" after column name is another one only supported in Hive, though
> that's relatively minor to support it in unified syntax.
>
>>
>>>    -
>>>
>>> For the clauses that are supported in one but not the other, we can add
>>> them to a unified rule as optional clauses. The AST builder would then
>>> validate what makes sense or not (e.g., stored as with using or row format
>>> delimited) and finally pass the remaining data on using the
>>> CreateTableStatement. That statement would be handled like we do for
>>> the Spark rule today, but with extra metadata to pass along. This is also a
>>> step toward being able to put Hive behind the DSv2 API because we’d be able
>>> to pass all of the Hive metadata clauses to the v2 catalog.
>>>
>>> The only difficult part is handling PARTITIONED BY. But in that case,
>>> we can use two different syntaxes from the same CREATE TABLE rule. If
>>> types are included, we use the Hive PARTITIONED BY syntax and convert
>>> in the AST builder to normalize to a single representation.
>>>
>> The proposal looks promising - it may add some sort of complexity but
> sounds like worth adding.
>
> One thing to make clear - in unified syntax we only rely on explicit
> provider, or default provider, right? I would concern if the proposal
> automatically uses Hive provider if Hive specific clauses are being used.
> Yes as I said earlier it may make end users' query to be changed, but
> better than uncertain.
>
> Btw, if the main purpose to add native syntax and change it by default is
> to discontinue supporting Hive create table rule sooner, simply dropping
> rule 2 with providing legacy config is still one of valid options I think.
>
>
>> What do you both think? This would make the behavior more clear and take
>>> a step toward getting rid of Hive-specific code.
>>>
>>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> I'm trying to understand the reason you have been suggesting to keep
>>>> the real thing unchanged but change doc instead. Could you please elaborate
>>>> why? End users would blame us when they hit the case their query doesn't
>>>> work as intended (1) and found the fact it's undocumented (2) and hard to
>>>> understand even from the Spark codebase (3).
>>>>
>>>> For me, addressing the root issue adopting your suggestion would be
>>>> "dropping the rule 2" and only supporting it with legacy config on. We
>>>> would say to end users, you need to enable the legacy config to leverage
>>>> Hive create table syntax, or just use beeline with Hive connected.
>>>>
>>>> But since we are even thinking about native syntax as a first class and
>>>> dropping Hive one implicitly (hide in doc) or explicitly, does it really
>>>> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would
>>>> have even less confusion than Spark 2.x, since we will require end users to
>>>> fill the marker regarding Hive when creating Hive table, easier to classify
>>>> than "USING provider".
>>>>
>>>> If we think native syntax would cover many cases end users have been
>>>> creating Hive table in Spark (say, USING hive would simply work for them),
>>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
>>>> really needed. If not, let's continue "fixing" the issue.
>>>>
>>>> (Another valid approach would be consolidating two rules into one, and
>>>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
>>>> FORMAT, etc. are only supported in Hive provider.)
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> The fact that we have 2 CREATE TABLE syntax is already confusing many
>>>>> users. Shall we only document the native syntax? Then users don't need to
>>>>> worry about which rule their query fits and they don't need to spend a lot
>>>>> of time understanding the subtle difference between these 2 syntaxes.
>>>>>
>>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> A bit correction: the example I provided for vice versa is not really
>>>>>> a correct case for vice versa. It's actually same case (intended to use
>>>>>> rule 2 which is not default) but different result.
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> My concern is that although we simply think about the change to
>>>>>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>>>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>>>>>> swap happens as intended), and there're still couple of things which make
>>>>>>> the query still fall into rule 2 which is non-trivial to reason about and
>>>>>>> also not easy to explain.
>>>>>>>
>>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into
>>>>>>> rule 2 to simplify the problem statement, but they're not only the case
>>>>>>> - using col_name1:col_type1 would make the query fall into rule 2
>>>>>>> regardless of any other properties.
>>>>>>>
>>>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1
>>>>>>> (which is not default) and specify the parameters which are only available
>>>>>>> in rule 1, it clearly requires "USING provider" - parser will throw error
>>>>>>> if there're any mistakes on specifying parameters. End users could set
>>>>>>> intention clearly which rule the query should be bound. If the query fails
>>>>>>> to bind the rule as intended, it's simply denied.
>>>>>>>
>>>>>>> In Spark 3.x, parser may not help figuring out the case where end
>>>>>>> users intend to use rule 2 (which is not default) but some mistake on
>>>>>>> specifying parameters - it could just "silently" bound into rule 1 and it
>>>>>>> may be even executed without any error. Vice versa happens, but in odd way
>>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE
>>>>>>> EXTERNAL TABLE is not supported.
>>>>>>>
>>>>>>> It's deterministic for end users only if they're fully understand
>>>>>>> the difference between the twos and also understand how Spark would apply
>>>>>>> the rules to make the query fall into one of the rule. I'm not sure how we
>>>>>>> can only improve documentation to make things be clear, but if the approach
>>>>>>> would be explaining the difference of rules and guide the tip to make the
>>>>>>> query be bound to the specific rule, the same could be applied to parser
>>>>>>> rule to address the root cause.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits
>>>>>>>> both native and Hive syntax. I'm fine with some changes to make it less
>>>>>>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>>>>>> is false.
>>>>>>>>
>>>>>>>> I still don't get your point about what's the real problem to end
>>>>>>>> users. There is no ambiguity as the behavior is deterministic, although we
>>>>>>>> rely on optional fields and rule order which is bad. It's hard to document,
>>>>>>>> but I don't think that's a big problem to end users.
>>>>>>>>
>>>>>>>> For the legacy config, it does make the implementation more
>>>>>>>> complicated, but it's invisible to most end users (we don't document it)
>>>>>>>> and can be super useful to some users who want the queries to keep working
>>>>>>>> in 3.0 without rewriting.
>>>>>>>>
>>>>>>>> If your only concern is documentation, I totally agree that we
>>>>>>>> should improve it.
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for sharing your view.
>>>>>>>>>
>>>>>>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>>>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>>>>>>> with.
>>>>>>>>>
>>>>>>>>> I'll quote my comments in SPARK-31136 here again to make the
>>>>>>>>> problem statement be clearer:
>>>>>>>>>
>>>>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because we
>>>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either
>>>>>>>>> default provider or Hive according to which options are provided, which
>>>>>>>>> seems to be non-trivial to reason about. (End users would never know, as
>>>>>>>>> it's completely from parser rule.)
>>>>>>>>>
>>>>>>>>> I feel this as the issue of "not breaking old behavior". The
>>>>>>>>> parser rule gets pretty much complicated due to support legacy config. Not
>>>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>>>>
>>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use
>>>>>>>>> the value of spark.sql.sources.default as its provider. In Spark version
>>>>>>>>> 2.4 and earlier, it was hive. To restore the behavior before Spark 3.0, you
>>>>>>>>> can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>>>>> didn't describe anything for this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>>>>
>>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>>>>>>> AS select_statement ]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>>>>
>>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> At least we should describe that parser will try to match the
>>>>>>>>> first case (create table ~ using data source), and fail back to second
>>>>>>>>> case; even though we describe this it's not intuitive to reason about which
>>>>>>>>> rule the DDL query will fall into. As I commented earlier, "ROW FORMAT" and
>>>>>>>>> "STORED AS" are the options which make DDL query fall into the second case,
>>>>>>>>> but they're described as "optional" so it's hard to catch the gotcha.
>>>>>>>>>
>>>>>>>>> Furthermore, while we document the syntax as above, in reality we
>>>>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Simply saying, do we really think end users should stop and try to
>>>>>>>>> match their query against the parser rules (or orders when we explain in
>>>>>>>>> the doc) by themselves to understand which provider the table will
>>>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>>>>>>> which is a serious problem.
>>>>>>>>>
>>>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then
>>>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" one and
>>>>>>>>> try to isolate each other? What's the point of providing a legacy config to
>>>>>>>>> go back to the old one even we fear about breaking something to make it
>>>>>>>>> better or clearer? We do think that table provider is important (hence the
>>>>>>>>> change was done), then is it still a trivial problem whether the provider
>>>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think the general guideline is to promote Spark's own CREATE
>>>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are
>>>>>>>>>> mutually exclusive because the native syntax requires the USING clause
>>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>>>>
>>>>>>>>>> It's a good move to make the USING clause optional, which makes
>>>>>>>>>> it easier to write the native CREATE TABLE syntax. Unfortunately, it leads
>>>>>>>>>> to some conflicts with the Hive CREATE TABLE syntax, but I don't see a
>>>>>>>>>> serious problem here. If a user just writes CREATE TABLE without USING
>>>>>>>>>> or ROW FORMAT or STORED AS, does it matter what table we create? Internally
>>>>>>>>>> the parser rules conflict and we pick the native syntax depending on the
>>>>>>>>>> rule order. But the user-facing behavior looks fine.
>>>>>>>>>>
>>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE syntax?
>>>>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi devs,
>>>>>>>>>>>
>>>>>>>>>>> I'd like to initiate discussion and hear the voices for
>>>>>>>>>>> resolving ambiguous parser rule between two "create table"s being brought
>>>>>>>>>>> by SPARK-30098 [1].
>>>>>>>>>>>
>>>>>>>>>>> Previously, "create table" parser rules were clearly
>>>>>>>>>>> distinguished via "USING provider", which was very intuitive and
>>>>>>>>>>> deterministic. Say, DDL query creates "Hive" table unless "USING provider"
>>>>>>>>>>> is specified,
>>>>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>>>>
>>>>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>>>>>>
>>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first
>>>>>>>>>>> rule to provide better message) brought more confusion (I've described the
>>>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>>>
>>>>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead
>>>>>>>>>>> of trying to document the difference and talk end users to be careful about
>>>>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>>>>
>>>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE
>>>>>>>>>>> ... "HIVE" TABLE ...`.
>>>>>>>>>>>
>>>>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>>>>> intend to create Hive table. (Given we will also provide legacy option I'm
>>>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>>>
>>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>>>>
>>>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>>>>
>>>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>
>>>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>>>> TABLE syntax
>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>>>>> 2.
>>>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>> 3.
>>>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Thanks Nicholas for the side comment; you'll need to interpret "CREATE
TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
may add the confusion.

Ryan, thanks for the detailed analysis and proposal. That's what I would
like to see in discussion thread.

I'm open to solutions which enable end users to specify their intention
properly - my main concern of SPARK-30098 is that it becomes unclear which
provider the query will use in create table unless USING provider is
explicitly specified. If the new proposal makes clear on this, that should
be better than now.

Replying inline:

On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Side comment: The current docs for CREATE TABLE
> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
> add to the confusion by describing the Hive-compatible command as "CREATE
> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
> actually part of the syntax
> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
> .
>
> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Jungtaek, it sounds like you consider the two rules to be separate
>> syntaxes with their own consistency rules. For example, if I am using the
>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>> columns and requires types for those columns; if I’m using the Spark syntax
>> rule with USING then PARTITIONED BY must reference existing columns and
>> cannot include types.
>>
>> I agree that this is confusing to users! We should fix it, but I don’t
>> think the right solution is to continue to have two rules with divergent
>> syntax.
>>
>> This is confusing to users because they don’t know anything about
>> separate parser rules. All the user sees is that sometimes PARTITION BY
>> requires types and sometimes it doesn’t. Yes, we could add a keyword,
>> HIVE, to signal that the syntax is borrowed from Hive for that case, but
>> that actually breaks queries that run in Hive.
>>
> That might less matter, because SPARK-30098 (and I guess your proposal as
well) enforces end users to add "USING HIVE" for their queries to enable
Hive provider in any way, even only when the query matches with rule 1
(conditional). Once they decide to create Hive table, the query might have
to be changed, or they have to change the default provider, or they have to
enable legacy config.


> I think the right solution is to unify the two syntaxes. I don’t think
>> they are so different that it isn’t possible. Here are the differences I
>> see:
>>
>>    - Only in Hive:
>>       - EXTERNAL
>>       - skewSpec: SKEWED BY ...
>>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>       - createFileFormat: STORED AS ...
>>    - Only in Spark:
>>       - OPTIONS property list
>>    - Different syntax/interpretation:
>>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>
>> ":" after column name is another one only supported in Hive, though
that's relatively minor to support it in unified syntax.

>
>>    -
>>
>> For the clauses that are supported in one but not the other, we can add
>> them to a unified rule as optional clauses. The AST builder would then
>> validate what makes sense or not (e.g., stored as with using or row format
>> delimited) and finally pass the remaining data on using the
>> CreateTableStatement. That statement would be handled like we do for the
>> Spark rule today, but with extra metadata to pass along. This is also a
>> step toward being able to put Hive behind the DSv2 API because we’d be able
>> to pass all of the Hive metadata clauses to the v2 catalog.
>>
>> The only difficult part is handling PARTITIONED BY. But in that case, we
>> can use two different syntaxes from the same CREATE TABLE rule. If types
>> are included, we use the Hive PARTITIONED BY syntax and convert in the
>> AST builder to normalize to a single representation.
>>
> The proposal looks promising - it may add some sort of complexity but
sounds like worth adding.

One thing to make clear - in unified syntax we only rely on explicit
provider, or default provider, right? I would concern if the proposal
automatically uses Hive provider if Hive specific clauses are being used.
Yes as I said earlier it may make end users' query to be changed, but
better than uncertain.

Btw, if the main purpose to add native syntax and change it by default is
to discontinue supporting Hive create table rule sooner, simply dropping
rule 2 with providing legacy config is still one of valid options I think.


> What do you both think? This would make the behavior more clear and take a
>> step toward getting rid of Hive-specific code.
>>
>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> I'm trying to understand the reason you have been suggesting to keep the
>>> real thing unchanged but change doc instead. Could you please elaborate
>>> why? End users would blame us when they hit the case their query doesn't
>>> work as intended (1) and found the fact it's undocumented (2) and hard to
>>> understand even from the Spark codebase (3).
>>>
>>> For me, addressing the root issue adopting your suggestion would be
>>> "dropping the rule 2" and only supporting it with legacy config on. We
>>> would say to end users, you need to enable the legacy config to leverage
>>> Hive create table syntax, or just use beeline with Hive connected.
>>>
>>> But since we are even thinking about native syntax as a first class and
>>> dropping Hive one implicitly (hide in doc) or explicitly, does it really
>>> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would
>>> have even less confusion than Spark 2.x, since we will require end users to
>>> fill the marker regarding Hive when creating Hive table, easier to classify
>>> than "USING provider".
>>>
>>> If we think native syntax would cover many cases end users have been
>>> creating Hive table in Spark (say, USING hive would simply work for them),
>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
>>> really needed. If not, let's continue "fixing" the issue.
>>>
>>> (Another valid approach would be consolidating two rules into one, and
>>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
>>> FORMAT, etc. are only supported in Hive provider.)
>>>
>>>
>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> The fact that we have 2 CREATE TABLE syntax is already confusing many
>>>> users. Shall we only document the native syntax? Then users don't need to
>>>> worry about which rule their query fits and they don't need to spend a lot
>>>> of time understanding the subtle difference between these 2 syntaxes.
>>>>
>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> A bit correction: the example I provided for vice versa is not really
>>>>> a correct case for vice versa. It's actually same case (intended to use
>>>>> rule 2 which is not default) but different result.
>>>>>
>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> My concern is that although we simply think about the change to
>>>>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>>>>> swap happens as intended), and there're still couple of things which make
>>>>>> the query still fall into rule 2 which is non-trivial to reason about and
>>>>>> also not easy to explain.
>>>>>>
>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into
>>>>>> rule 2 to simplify the problem statement, but they're not only the case
>>>>>> - using col_name1:col_type1 would make the query fall into rule 2
>>>>>> regardless of any other properties.
>>>>>>
>>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1
>>>>>> (which is not default) and specify the parameters which are only available
>>>>>> in rule 1, it clearly requires "USING provider" - parser will throw error
>>>>>> if there're any mistakes on specifying parameters. End users could set
>>>>>> intention clearly which rule the query should be bound. If the query fails
>>>>>> to bind the rule as intended, it's simply denied.
>>>>>>
>>>>>> In Spark 3.x, parser may not help figuring out the case where end
>>>>>> users intend to use rule 2 (which is not default) but some mistake on
>>>>>> specifying parameters - it could just "silently" bound into rule 1 and it
>>>>>> may be even executed without any error. Vice versa happens, but in odd way
>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE
>>>>>> EXTERNAL TABLE is not supported.
>>>>>>
>>>>>> It's deterministic for end users only if they're fully understand the
>>>>>> difference between the twos and also understand how Spark would apply the
>>>>>> rules to make the query fall into one of the rule. I'm not sure how we can
>>>>>> only improve documentation to make things be clear, but if the approach
>>>>>> would be explaining the difference of rules and guide the tip to make the
>>>>>> query be bound to the specific rule, the same could be applied to parser
>>>>>> rule to address the root cause.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits
>>>>>>> both native and Hive syntax. I'm fine with some changes to make it less
>>>>>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>>>>> is false.
>>>>>>>
>>>>>>> I still don't get your point about what's the real problem to end
>>>>>>> users. There is no ambiguity as the behavior is deterministic, although we
>>>>>>> rely on optional fields and rule order which is bad. It's hard to document,
>>>>>>> but I don't think that's a big problem to end users.
>>>>>>>
>>>>>>> For the legacy config, it does make the implementation more
>>>>>>> complicated, but it's invisible to most end users (we don't document it)
>>>>>>> and can be super useful to some users who want the queries to keep working
>>>>>>> in 3.0 without rewriting.
>>>>>>>
>>>>>>> If your only concern is documentation, I totally agree that we
>>>>>>> should improve it.
>>>>>>>
>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks for sharing your view.
>>>>>>>>
>>>>>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>>>>>> with.
>>>>>>>>
>>>>>>>> I'll quote my comments in SPARK-31136 here again to make the
>>>>>>>> problem statement be clearer:
>>>>>>>>
>>>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because we
>>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either
>>>>>>>> default provider or Hive according to which options are provided, which
>>>>>>>> seems to be non-trivial to reason about. (End users would never know, as
>>>>>>>> it's completely from parser rule.)
>>>>>>>>
>>>>>>>> I feel this as the issue of "not breaking old behavior". The parser
>>>>>>>> rule gets pretty much complicated due to support legacy config. Not
>>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>>>
>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use
>>>>>>>> the value of spark.sql.sources.default as its provider. In Spark version
>>>>>>>> 2.4 and earlier, it was hive. To restore the behavior before Spark 3.0, you
>>>>>>>> can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>>>>
>>>>>>>>
>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>>>> didn't describe anything for this.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>>>
>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>>>>>> AS select_statement ]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>>>
>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>>>
>>>>>>>>
>>>>>>>> At least we should describe that parser will try to match the first
>>>>>>>> case (create table ~ using data source), and fail back to second case; even
>>>>>>>> though we describe this it's not intuitive to reason about which rule the
>>>>>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>>>>>>> AS" are the options which make DDL query fall into the second case, but
>>>>>>>> they're described as "optional" so it's hard to catch the gotcha.
>>>>>>>>
>>>>>>>> Furthermore, while we document the syntax as above, in reality we
>>>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>>>>>
>>>>>>>>
>>>>>>>> Simply saying, do we really think end users should stop and try to
>>>>>>>> match their query against the parser rules (or orders when we explain in
>>>>>>>> the doc) by themselves to understand which provider the table will
>>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>>>>>> which is a serious problem.
>>>>>>>>
>>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then
>>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" one and
>>>>>>>> try to isolate each other? What's the point of providing a legacy config to
>>>>>>>> go back to the old one even we fear about breaking something to make it
>>>>>>>> better or clearer? We do think that table provider is important (hence the
>>>>>>>> change was done), then is it still a trivial problem whether the provider
>>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think the general guideline is to promote Spark's own CREATE
>>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are
>>>>>>>>> mutually exclusive because the native syntax requires the USING clause
>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>>>
>>>>>>>>> It's a good move to make the USING clause optional, which makes it
>>>>>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>>>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>>>>>>> order. But the user-facing behavior looks fine.
>>>>>>>>>
>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE syntax?
>>>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>>>
>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi devs,
>>>>>>>>>>
>>>>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>>>>> SPARK-30098 [1].
>>>>>>>>>>
>>>>>>>>>> Previously, "create table" parser rules were clearly
>>>>>>>>>> distinguished via "USING provider", which was very intuitive and
>>>>>>>>>> deterministic. Say, DDL query creates "Hive" table unless "USING provider"
>>>>>>>>>> is specified,
>>>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>>>
>>>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>>>>>
>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first
>>>>>>>>>> rule to provide better message) brought more confusion (I've described the
>>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>>
>>>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead
>>>>>>>>>> of trying to document the difference and talk end users to be careful about
>>>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>>>
>>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE
>>>>>>>>>> ... "HIVE" TABLE ...`.
>>>>>>>>>>
>>>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>>>> intend to create Hive table. (Given we will also provide legacy option I'm
>>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>>
>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>>>
>>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>>>
>>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>
>>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>>> TABLE syntax
>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>>>> 2.
>>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>> 3.
>>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>>>
>>>>>>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Nicholas Chammas <ni...@gmail.com>.

Side comment: The current docs for CREATE TABLE
<https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
add to the confusion by describing the Hive-compatible command as "CREATE
TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
actually part of the syntax
<https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
.

On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Jungtaek, it sounds like you consider the two rules to be separate
> syntaxes with their own consistency rules. For example, if I am using the
> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
> columns and requires types for those columns; if I’m using the Spark syntax
> rule with USING then PARTITIONED BY must reference existing columns and
> cannot include types.
>
> I agree that this is confusing to users! We should fix it, but I don’t
> think the right solution is to continue to have two rules with divergent
> syntax.
>
> This is confusing to users because they don’t know anything about separate
> parser rules. All the user sees is that sometimes PARTITION BY requires
> types and sometimes it doesn’t. Yes, we could add a keyword, HIVE, to
> signal that the syntax is borrowed from Hive for that case, but that
> actually breaks queries that run in Hive.
>
> I think the right solution is to unify the two syntaxes. I don’t think
> they are so different that it isn’t possible. Here are the differences I
> see:
>
>    - Only in Hive:
>       - EXTERNAL
>       - skewSpec: SKEWED BY ...
>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>       - createFileFormat: STORED AS ...
>    - Only in Spark:
>       - OPTIONS property list
>    - Different syntax/interpretation:
>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>
> For the clauses that are supported in one but not the other, we can add
> them to a unified rule as optional clauses. The AST builder would then
> validate what makes sense or not (e.g., stored as with using or row format
> delimited) and finally pass the remaining data on using the
> CreateTableStatement. That statement would be handled like we do for the
> Spark rule today, but with extra metadata to pass along. This is also a
> step toward being able to put Hive behind the DSv2 API because we’d be able
> to pass all of the Hive metadata clauses to the v2 catalog.
>
> The only difficult part is handling PARTITIONED BY. But in that case, we
> can use two different syntaxes from the same CREATE TABLE rule. If types
> are included, we use the Hive PARTITIONED BY syntax and convert in the
> AST builder to normalize to a single representation.
>
> What do you both think? This would make the behavior more clear and take a
> step toward getting rid of Hive-specific code.
>
> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> I'm trying to understand the reason you have been suggesting to keep the
>> real thing unchanged but change doc instead. Could you please elaborate
>> why? End users would blame us when they hit the case their query doesn't
>> work as intended (1) and found the fact it's undocumented (2) and hard to
>> understand even from the Spark codebase (3).
>>
>> For me, addressing the root issue adopting your suggestion would be
>> "dropping the rule 2" and only supporting it with legacy config on. We
>> would say to end users, you need to enable the legacy config to leverage
>> Hive create table syntax, or just use beeline with Hive connected.
>>
>> But since we are even thinking about native syntax as a first class and
>> dropping Hive one implicitly (hide in doc) or explicitly, does it really
>> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would
>> have even less confusion than Spark 2.x, since we will require end users to
>> fill the marker regarding Hive when creating Hive table, easier to classify
>> than "USING provider".
>>
>> If we think native syntax would cover many cases end users have been
>> creating Hive table in Spark (say, USING hive would simply work for them),
>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
>> really needed. If not, let's continue "fixing" the issue.
>>
>> (Another valid approach would be consolidating two rules into one, and
>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
>> FORMAT, etc. are only supported in Hive provider.)
>>
>>
>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> The fact that we have 2 CREATE TABLE syntax is already confusing many
>>> users. Shall we only document the native syntax? Then users don't need to
>>> worry about which rule their query fits and they don't need to spend a lot
>>> of time understanding the subtle difference between these 2 syntaxes.
>>>
>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> A bit correction: the example I provided for vice versa is not really a
>>>> correct case for vice versa. It's actually same case (intended to use rule
>>>> 2 which is not default) but different result.
>>>>
>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> My concern is that although we simply think about the change to
>>>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>>>> swap happens as intended), and there're still couple of things which make
>>>>> the query still fall into rule 2 which is non-trivial to reason about and
>>>>> also not easy to explain.
>>>>>
>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into
>>>>> rule 2 to simplify the problem statement, but they're not only the case
>>>>> - using col_name1:col_type1 would make the query fall into rule 2
>>>>> regardless of any other properties.
>>>>>
>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1
>>>>> (which is not default) and specify the parameters which are only available
>>>>> in rule 1, it clearly requires "USING provider" - parser will throw error
>>>>> if there're any mistakes on specifying parameters. End users could set
>>>>> intention clearly which rule the query should be bound. If the query fails
>>>>> to bind the rule as intended, it's simply denied.
>>>>>
>>>>> In Spark 3.x, parser may not help figuring out the case where end
>>>>> users intend to use rule 2 (which is not default) but some mistake on
>>>>> specifying parameters - it could just "silently" bound into rule 1 and it
>>>>> may be even executed without any error. Vice versa happens, but in odd way
>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE
>>>>> EXTERNAL TABLE is not supported.
>>>>>
>>>>> It's deterministic for end users only if they're fully understand the
>>>>> difference between the twos and also understand how Spark would apply the
>>>>> rules to make the query fall into one of the rule. I'm not sure how we can
>>>>> only improve documentation to make things be clear, but if the approach
>>>>> would be explaining the difference of rules and guide the tip to make the
>>>>> query be bound to the specific rule, the same could be applied to parser
>>>>> rule to address the root cause.
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>>>>>> native and Hive syntax. I'm fine with some changes to make it less
>>>>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>>>> is false.
>>>>>>
>>>>>> I still don't get your point about what's the real problem to end
>>>>>> users. There is no ambiguity as the behavior is deterministic, although we
>>>>>> rely on optional fields and rule order which is bad. It's hard to document,
>>>>>> but I don't think that's a big problem to end users.
>>>>>>
>>>>>> For the legacy config, it does make the implementation more
>>>>>> complicated, but it's invisible to most end users (we don't document it)
>>>>>> and can be super useful to some users who want the queries to keep working
>>>>>> in 3.0 without rewriting.
>>>>>>
>>>>>> If your only concern is documentation, I totally agree that we should
>>>>>> improve it.
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for sharing your view.
>>>>>>>
>>>>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>>>>> with.
>>>>>>>
>>>>>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>>>>>> statement be clearer:
>>>>>>>
>>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because we
>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either
>>>>>>> default provider or Hive according to which options are provided, which
>>>>>>> seems to be non-trivial to reason about. (End users would never know, as
>>>>>>> it's completely from parser rule.)
>>>>>>>
>>>>>>> I feel this as the issue of "not breaking old behavior". The parser
>>>>>>> rule gets pretty much complicated due to support legacy config. Not
>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>>
>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use
>>>>>>> the value of spark.sql.sources.default as its provider. In Spark version
>>>>>>> 2.4 and earlier, it was hive. To restore the behavior before Spark 3.0, you
>>>>>>> can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>>>
>>>>>>>
>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>>> didn't describe anything for this.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>>
>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>>>>> AS select_statement ]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>>
>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>>
>>>>>>>
>>>>>>> At least we should describe that parser will try to match the first
>>>>>>> case (create table ~ using data source), and fail back to second case; even
>>>>>>> though we describe this it's not intuitive to reason about which rule the
>>>>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>>>>>> AS" are the options which make DDL query fall into the second case, but
>>>>>>> they're described as "optional" so it's hard to catch the gotcha.
>>>>>>>
>>>>>>> Furthermore, while we document the syntax as above, in reality we
>>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>>>>
>>>>>>>
>>>>>>> Simply saying, do we really think end users should stop and try to
>>>>>>> match their query against the parser rules (or orders when we explain in
>>>>>>> the doc) by themselves to understand which provider the table will
>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>>>>> which is a serious problem.
>>>>>>>
>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then
>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" one and
>>>>>>> try to isolate each other? What's the point of providing a legacy config to
>>>>>>> go back to the old one even we fear about breaking something to make it
>>>>>>> better or clearer? We do think that table provider is important (hence the
>>>>>>> change was done), then is it still a trivial problem whether the provider
>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think the general guideline is to promote Spark's own CREATE
>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are
>>>>>>>> mutually exclusive because the native syntax requires the USING clause
>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>>
>>>>>>>> It's a good move to make the USING clause optional, which makes it
>>>>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>>>>>> order. But the user-facing behavior looks fine.
>>>>>>>>
>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE syntax?
>>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>>
>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi devs,
>>>>>>>>>
>>>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>>>> SPARK-30098 [1].
>>>>>>>>>
>>>>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>>>>> via "USING provider", which was very intuitive and deterministic. Say, DDL
>>>>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>>
>>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>>>>
>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first
>>>>>>>>> rule to provide better message) brought more confusion (I've described the
>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>
>>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead
>>>>>>>>> of trying to document the difference and talk end users to be careful about
>>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>>
>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>>>>> "HIVE" TABLE ...`.
>>>>>>>>>
>>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>>> intend to create Hive table. (Given we will also provide legacy option I'm
>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>
>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>>
>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>>
>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>
>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>> syntax
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>>> 2.
>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>> 3.
>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>>
>>>>>>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Jungtaek, it sounds like you consider the two rules to be separate syntaxes
with their own consistency rules. For example, if I am using the Hive
syntax rule, then the PARTITIONED BY clause adds new (partition) columns
and requires types for those columns; if I’m using the Spark syntax rule
with USING then PARTITIONED BY must reference existing columns and cannot
include types.

I agree that this is confusing to users! We should fix it, but I don’t
think the right solution is to continue to have two rules with divergent
syntax.

This is confusing to users because they don’t know anything about separate
parser rules. All the user sees is that sometimes PARTITION BY requires
types and sometimes it doesn’t. Yes, we could add a keyword, HIVE, to
signal that the syntax is borrowed from Hive for that case, but that
actually breaks queries that run in Hive.

I think the right solution is to unify the two syntaxes. I don’t think they
are so different that it isn’t possible. Here are the differences I see:

   - Only in Hive:
      - EXTERNAL
      - skewSpec: SKEWED BY ...
      - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
      - createFileFormat: STORED AS ...
   - Only in Spark:
      - OPTIONS property list
   - Different syntax/interpretation:
      - PARTITIONED BY transformList / PARTITIONED BY colTypeList

For the clauses that are supported in one but not the other, we can add
them to a unified rule as optional clauses. The AST builder would then
validate what makes sense or not (e.g., stored as with using or row format
delimited) and finally pass the remaining data on using the
CreateTableStatement. That statement would be handled like we do for the
Spark rule today, but with extra metadata to pass along. This is also a
step toward being able to put Hive behind the DSv2 API because we’d be able
to pass all of the Hive metadata clauses to the v2 catalog.

The only difficult part is handling PARTITIONED BY. But in that case, we
can use two different syntaxes from the same CREATE TABLE rule. If types
are included, we use the Hive PARTITIONED BY syntax and convert in the AST
builder to normalize to a single representation.

What do you both think? This would make the behavior more clear and take a
step toward getting rid of Hive-specific code.

On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> I'm trying to understand the reason you have been suggesting to keep the
> real thing unchanged but change doc instead. Could you please elaborate
> why? End users would blame us when they hit the case their query doesn't
> work as intended (1) and found the fact it's undocumented (2) and hard to
> understand even from the Spark codebase (3).
>
> For me, addressing the root issue adopting your suggestion would be
> "dropping the rule 2" and only supporting it with legacy config on. We
> would say to end users, you need to enable the legacy config to leverage
> Hive create table syntax, or just use beeline with Hive connected.
>
> But since we are even thinking about native syntax as a first class and
> dropping Hive one implicitly (hide in doc) or explicitly, does it really
> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would
> have even less confusion than Spark 2.x, since we will require end users to
> fill the marker regarding Hive when creating Hive table, easier to classify
> than "USING provider".
>
> If we think native syntax would cover many cases end users have been
> creating Hive table in Spark (say, USING hive would simply work for them),
> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
> really needed. If not, let's continue "fixing" the issue.
>
> (Another valid approach would be consolidating two rules into one, and
> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
> FORMAT, etc. are only supported in Hive provider.)
>
>
> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> The fact that we have 2 CREATE TABLE syntax is already confusing many
>> users. Shall we only document the native syntax? Then users don't need to
>> worry about which rule their query fits and they don't need to spend a lot
>> of time understanding the subtle difference between these 2 syntaxes.
>>
>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> A bit correction: the example I provided for vice versa is not really a
>>> correct case for vice versa. It's actually same case (intended to use rule
>>> 2 which is not default) but different result.
>>>
>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> My concern is that although we simply think about the change to
>>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>>> swap happens as intended), and there're still couple of things which make
>>>> the query still fall into rule 2 which is non-trivial to reason about and
>>>> also not easy to explain.
>>>>
>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into rule
>>>> 2 to simplify the problem statement, but they're not only the case - using
>>>> col_name1:col_type1 would make the query fall into rule 2 regardless of any
>>>> other properties.
>>>>
>>>> What's the matter? In Spark 2.x, if end users want to use rule 1 (which
>>>> is not default) and specify the parameters which are only available in rule
>>>> 1, it clearly requires "USING provider" - parser will throw error if
>>>> there're any mistakes on specifying parameters. End users could set
>>>> intention clearly which rule the query should be bound. If the query fails
>>>> to bind the rule as intended, it's simply denied.
>>>>
>>>> In Spark 3.x, parser may not help figuring out the case where end users
>>>> intend to use rule 2 (which is not default) but some mistake on specifying
>>>> parameters - it could just "silently" bound into rule 1 and it may be even
>>>> executed without any error. Vice versa happens, but in odd way - e.g.
>>>> CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL
>>>> TABLE is not supported.
>>>>
>>>> It's deterministic for end users only if they're fully understand the
>>>> difference between the twos and also understand how Spark would apply the
>>>> rules to make the query fall into one of the rule. I'm not sure how we can
>>>> only improve documentation to make things be clear, but if the approach
>>>> would be explaining the difference of rules and guide the tip to make the
>>>> query be bound to the specific rule, the same could be applied to parser
>>>> rule to address the root cause.
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>>>>> native and Hive syntax. I'm fine with some changes to make it less
>>>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>>> is false.
>>>>>
>>>>> I still don't get your point about what's the real problem to end
>>>>> users. There is no ambiguity as the behavior is deterministic, although we
>>>>> rely on optional fields and rule order which is bad. It's hard to document,
>>>>> but I don't think that's a big problem to end users.
>>>>>
>>>>> For the legacy config, it does make the implementation more
>>>>> complicated, but it's invisible to most end users (we don't document it)
>>>>> and can be super useful to some users who want the queries to keep working
>>>>> in 3.0 without rewriting.
>>>>>
>>>>> If your only concern is documentation, I totally agree that we should
>>>>> improve it.
>>>>>
>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Thanks for sharing your view.
>>>>>>
>>>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>>>> with.
>>>>>>
>>>>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>>>>> statement be clearer:
>>>>>>
>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because we
>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either
>>>>>> default provider or Hive according to which options are provided, which
>>>>>> seems to be non-trivial to reason about. (End users would never know, as
>>>>>> it's completely from parser rule.)
>>>>>>
>>>>>> I feel this as the issue of "not breaking old behavior". The parser
>>>>>> rule gets pretty much complicated due to support legacy config. Not
>>>>>> breaking anything would make us be stuck eventually.
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>
>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use
>>>>>> the value of spark.sql.sources.default as its provider. In Spark version
>>>>>> 2.4 and earlier, it was hive. To restore the behavior before Spark 3.0, you
>>>>>> can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>>
>>>>>>
>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>> didn't describe anything for this.
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>
>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>>>> AS select_statement ]
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>
>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>
>>>>>>
>>>>>> At least we should describe that parser will try to match the first
>>>>>> case (create table ~ using data source), and fail back to second case; even
>>>>>> though we describe this it's not intuitive to reason about which rule the
>>>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>>>>> AS" are the options which make DDL query fall into the second case, but
>>>>>> they're described as "optional" so it's hard to catch the gotcha.
>>>>>>
>>>>>> Furthermore, while we document the syntax as above, in reality we
>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>>>
>>>>>>
>>>>>> Simply saying, do we really think end users should stop and try to
>>>>>> match their query against the parser rules (or orders when we explain in
>>>>>> the doc) by themselves to understand which provider the table will
>>>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>>>> which is a serious problem.
>>>>>>
>>>>>> If we really want to promote Spark's one for CREATE TABLE, then would
>>>>>> it really matter to treat Hive CREATE TABLE be "exceptional" one and try to
>>>>>> isolate each other? What's the point of providing a legacy config to go
>>>>>> back to the old one even we fear about breaking something to make it better
>>>>>> or clearer? We do think that table provider is important (hence the change
>>>>>> was done), then is it still a trivial problem whether the provider is
>>>>>> affected by specifying the "optional" fields?
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>>>>>> syntax instead of the Hive one. Previously these two rules are mutually
>>>>>>> exclusive because the native syntax requires the USING clause while the
>>>>>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>
>>>>>>> It's a good move to make the USING clause optional, which makes it
>>>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>>>>> order. But the user-facing behavior looks fine.
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE syntax?
>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi devs,
>>>>>>>>
>>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>>> SPARK-30098 [1].
>>>>>>>>
>>>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>>>> via "USING provider", which was very intuitive and deterministic. Say, DDL
>>>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>
>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>>>
>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule
>>>>>>>> to provide better message) brought more confusion (I've described the
>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>
>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>>>>> trying to document the difference and talk end users to be careful about
>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>
>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>>>> "HIVE" TABLE ...`.
>>>>>>>>
>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>> intend to create Hive table. (Given we will also provide legacy option I'm
>>>>>>>> feeling this is acceptable.)
>>>>>>>>
>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>
>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>
>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>> syntax
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>> 2.
>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 3.
>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>
>>>>>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

I'm trying to understand the reason you have been suggesting to keep the
real thing unchanged but change doc instead. Could you please elaborate
why? End users would blame us when they hit the case their query doesn't
work as intended (1) and found the fact it's undocumented (2) and hard to
understand even from the Spark codebase (3).

For me, addressing the root issue adopting your suggestion would be
"dropping the rule 2" and only supporting it with legacy config on. We
would say to end users, you need to enable the legacy config to leverage
Hive create table syntax, or just use beeline with Hive connected.

But since we are even thinking about native syntax as a first class and
dropping Hive one implicitly (hide in doc) or explicitly, does it really
matter we require a marker (like "HIVE") in rule 2 and isolate it? It would
have even less confusion than Spark 2.x, since we will require end users to
fill the marker regarding Hive when creating Hive table, easier to classify
than "USING provider".

If we think native syntax would cover many cases end users have been
creating Hive table in Spark (say, USING hive would simply work for them),
I'm OK to drop the rule 2 and lead end users to enable the legacy config if
really needed. If not, let's continue "fixing" the issue.

(Another valid approach would be consolidating two rules into one, and
defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
FORMAT, etc. are only supported in Hive provider.)


On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cl...@gmail.com> wrote:

> The fact that we have 2 CREATE TABLE syntax is already confusing many
> users. Shall we only document the native syntax? Then users don't need to
> worry about which rule their query fits and they don't need to spend a lot
> of time understanding the subtle difference between these 2 syntaxes.
>
> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> A bit correction: the example I provided for vice versa is not really a
>> correct case for vice versa. It's actually same case (intended to use rule
>> 2 which is not default) but different result.
>>
>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> My concern is that although we simply think about the change to
>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>> swap happens as intended), and there're still couple of things which make
>>> the query still fall into rule 2 which is non-trivial to reason about and
>>> also not easy to explain.
>>>
>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into rule
>>> 2 to simplify the problem statement, but they're not only the case - using
>>> col_name1:col_type1 would make the query fall into rule 2 regardless of any
>>> other properties.
>>>
>>> What's the matter? In Spark 2.x, if end users want to use rule 1 (which
>>> is not default) and specify the parameters which are only available in rule
>>> 1, it clearly requires "USING provider" - parser will throw error if
>>> there're any mistakes on specifying parameters. End users could set
>>> intention clearly which rule the query should be bound. If the query fails
>>> to bind the rule as intended, it's simply denied.
>>>
>>> In Spark 3.x, parser may not help figuring out the case where end users
>>> intend to use rule 2 (which is not default) but some mistake on specifying
>>> parameters - it could just "silently" bound into rule 1 and it may be even
>>> executed without any error. Vice versa happens, but in odd way - e.g.
>>> CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL
>>> TABLE is not supported.
>>>
>>> It's deterministic for end users only if they're fully understand the
>>> difference between the twos and also understand how Spark would apply the
>>> rules to make the query fall into one of the rule. I'm not sure how we can
>>> only improve documentation to make things be clear, but if the approach
>>> would be explaining the difference of rules and guide the tip to make the
>>> query be bound to the specific rule, the same could be applied to parser
>>> rule to address the root cause.
>>>
>>>
>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>>>> native and Hive syntax. I'm fine with some changes to make it less
>>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>> is false.
>>>>
>>>> I still don't get your point about what's the real problem to end
>>>> users. There is no ambiguity as the behavior is deterministic, although we
>>>> rely on optional fields and rule order which is bad. It's hard to document,
>>>> but I don't think that's a big problem to end users.
>>>>
>>>> For the legacy config, it does make the implementation more
>>>> complicated, but it's invisible to most end users (we don't document it)
>>>> and can be super useful to some users who want the queries to keep working
>>>> in 3.0 without rewriting.
>>>>
>>>> If your only concern is documentation, I totally agree that we should
>>>> improve it.
>>>>
>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> Thanks for sharing your view.
>>>>>
>>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>>> with.
>>>>>
>>>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>>>> statement be clearer:
>>>>>
>>>>> I think the parser implementation around CREATE TABLE brings ambiguity
>>>>> which is not documented anywhere. It wasn't ambiguous because we forced to
>>>>> specify USE provider if it's not a Hive table. Now it's either default
>>>>> provider or Hive according to which options are provided, which seems to be
>>>>> non-trivial to reason about. (End users would never know, as it's
>>>>> completely from parser rule.)
>>>>>
>>>>> I feel this as the issue of "not breaking old behavior". The parser
>>>>> rule gets pretty much complicated due to support legacy config. Not
>>>>> breaking anything would make us be stuck eventually.
>>>>>
>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>
>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use the
>>>>> value of spark.sql.sources.default as its provider. In Spark version 2.4
>>>>> and earlier, it was hive. To restore the behavior before Spark 3.0, you can
>>>>> set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>
>>>>>
>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>> didn't describe anything for this.
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>
>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>>> AS select_statement ]
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>
>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>
>>>>>
>>>>> At least we should describe that parser will try to match the first
>>>>> case (create table ~ using data source), and fail back to second case; even
>>>>> though we describe this it's not intuitive to reason about which rule the
>>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>>>> AS" are the options which make DDL query fall into the second case, but
>>>>> they're described as "optional" so it's hard to catch the gotcha.
>>>>>
>>>>> Furthermore, while we document the syntax as above, in reality we
>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>>
>>>>>
>>>>> Simply saying, do we really think end users should stop and try to
>>>>> match their query against the parser rules (or orders when we explain in
>>>>> the doc) by themselves to understand which provider the table will
>>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>>> which is a serious problem.
>>>>>
>>>>> If we really want to promote Spark's one for CREATE TABLE, then would
>>>>> it really matter to treat Hive CREATE TABLE be "exceptional" one and try to
>>>>> isolate each other? What's the point of providing a legacy config to go
>>>>> back to the old one even we fear about breaking something to make it better
>>>>> or clearer? We do think that table provider is important (hence the change
>>>>> was done), then is it still a trivial problem whether the provider is
>>>>> affected by specifying the "optional" fields?
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>>>>> syntax instead of the Hive one. Previously these two rules are mutually
>>>>>> exclusive because the native syntax requires the USING clause while the
>>>>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>
>>>>>> It's a good move to make the USING clause optional, which makes it
>>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>>>> order. But the user-facing behavior looks fine.
>>>>>>
>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
>>>>>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
>>>>>> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>
>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> Hi devs,
>>>>>>>
>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>> SPARK-30098 [1].
>>>>>>>
>>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>>> via "USING provider", which was very intuitive and deterministic. Say, DDL
>>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>
>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>>
>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule
>>>>>>> to provide better message) brought more confusion (I've described the
>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>
>>>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>>>> trying to document the difference and talk end users to be careful about
>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>
>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>>> "HIVE" TABLE ...`.
>>>>>>>
>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>> cons. This would lead end users to change their query if they intend
>>>>>>> to create Hive table. (Given we will also provide legacy option I'm feeling
>>>>>>> this is acceptable.)
>>>>>>>
>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>
>>>>>>> pros. Less invasive for existing queries.
>>>>>>> cons. Less intuitive, because they have been optional and now become
>>>>>>> mandatory to fall into the second rule.
>>>>>>>
>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>> syntax
>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>> 2.
>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>> 3.
>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>
>>>>>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Wenchen Fan <cl...@gmail.com>.

The fact that we have 2 CREATE TABLE syntax is already confusing many
users. Shall we only document the native syntax? Then users don't need to
worry about which rule their query fits and they don't need to spend a lot
of time understanding the subtle difference between these 2 syntaxes.

On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> A bit correction: the example I provided for vice versa is not really a
> correct case for vice versa. It's actually same case (intended to use rule
> 2 which is not default) but different result.
>
> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> My concern is that although we simply think about the change to
>> mark "USING provider" to be optional in rule 1, but in reality the change
>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>> swap happens as intended), and there're still couple of things which make
>> the query still fall into rule 2 which is non-trivial to reason about and
>> also not easy to explain.
>>
>> I only mentioned ROW FORMAT and STORED AS as the case to fall into rule 2
>> to simplify the problem statement, but they're not only the case - using
>> col_name1:col_type1 would make the query fall into rule 2 regardless of any
>> other properties.
>>
>> What's the matter? In Spark 2.x, if end users want to use rule 1 (which
>> is not default) and specify the parameters which are only available in rule
>> 1, it clearly requires "USING provider" - parser will throw error if
>> there're any mistakes on specifying parameters. End users could set
>> intention clearly which rule the query should be bound. If the query fails
>> to bind the rule as intended, it's simply denied.
>>
>> In Spark 3.x, parser may not help figuring out the case where end users
>> intend to use rule 2 (which is not default) but some mistake on specifying
>> parameters - it could just "silently" bound into rule 1 and it may be even
>> executed without any error. Vice versa happens, but in odd way - e.g.
>> CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL
>> TABLE is not supported.
>>
>> It's deterministic for end users only if they're fully understand the
>> difference between the twos and also understand how Spark would apply the
>> rules to make the query fall into one of the rule. I'm not sure how we can
>> only improve documentation to make things be clear, but if the approach
>> would be explaining the difference of rules and guide the tip to make the
>> query be bound to the specific rule, the same could be applied to parser
>> rule to address the root cause.
>>
>>
>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>>> native and Hive syntax. I'm fine with some changes to make it less
>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>> is false.
>>>
>>> I still don't get your point about what's the real problem to end users.
>>> There is no ambiguity as the behavior is deterministic, although we rely on
>>> optional fields and rule order which is bad. It's hard to document, but I
>>> don't think that's a big problem to end users.
>>>
>>> For the legacy config, it does make the implementation more complicated,
>>> but it's invisible to most end users (we don't document it) and can be
>>> super useful to some users who want the queries to keep working in 3.0
>>> without rewriting.
>>>
>>> If your only concern is documentation, I totally agree that we should
>>> improve it.
>>>
>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Thanks for sharing your view.
>>>>
>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>> with.
>>>>
>>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>>> statement be clearer:
>>>>
>>>> I think the parser implementation around CREATE TABLE brings ambiguity
>>>> which is not documented anywhere. It wasn't ambiguous because we forced to
>>>> specify USE provider if it's not a Hive table. Now it's either default
>>>> provider or Hive according to which options are provided, which seems to be
>>>> non-trivial to reason about. (End users would never know, as it's
>>>> completely from parser rule.)
>>>>
>>>> I feel this as the issue of "not breaking old behavior". The parser
>>>> rule gets pretty much complicated due to support legacy config. Not
>>>> breaking anything would make us be stuck eventually.
>>>>
>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>
>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use the
>>>> value of spark.sql.sources.default as its provider. In Spark version 2.4
>>>> and earlier, it was hive. To restore the behavior before Spark 3.0, you can
>>>> set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>
>>>>
>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't
>>>> describe anything for this.
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>
>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1
>>>> [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>> AS select_statement ]
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>
>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>
>>>>
>>>> At least we should describe that parser will try to match the first
>>>> case (create table ~ using data source), and fail back to second case; even
>>>> though we describe this it's not intuitive to reason about which rule the
>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>>> AS" are the options which make DDL query fall into the second case, but
>>>> they're described as "optional" so it's hard to catch the gotcha.
>>>>
>>>> Furthermore, while we document the syntax as above, in reality we allow
>>>> "EXTERNAL" in first rule (and throw error), which ends up existing DDL
>>>> query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>
>>>>
>>>> Simply saying, do we really think end users should stop and try to
>>>> match their query against the parser rules (or orders when we explain in
>>>> the doc) by themselves to understand which provider the table will
>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>> which is a serious problem.
>>>>
>>>> If we really want to promote Spark's one for CREATE TABLE, then would
>>>> it really matter to treat Hive CREATE TABLE be "exceptional" one and try to
>>>> isolate each other? What's the point of providing a legacy config to go
>>>> back to the old one even we fear about breaking something to make it better
>>>> or clearer? We do think that table provider is important (hence the change
>>>> was done), then is it still a trivial problem whether the provider is
>>>> affected by specifying the "optional" fields?
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>>>> syntax instead of the Hive one. Previously these two rules are mutually
>>>>> exclusive because the native syntax requires the USING clause while the
>>>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>
>>>>> It's a good move to make the USING clause optional, which makes it
>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>>> order. But the user-facing behavior looks fine.
>>>>>
>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
>>>>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
>>>>> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>
>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Hi devs,
>>>>>>
>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>> SPARK-30098 [1].
>>>>>>
>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>> via "USING provider", which was very intuitive and deterministic. Say, DDL
>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>
>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>
>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule
>>>>>> to provide better message) brought more confusion (I've described the
>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>
>>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>>> trying to document the difference and talk end users to be careful about
>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>
>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>> "HIVE" TABLE ...`.
>>>>>>
>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>> cons. This would lead end users to change their query if they intend
>>>>>> to create Hive table. (Given we will also provide legacy option I'm feeling
>>>>>> this is acceptable.)
>>>>>>
>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>
>>>>>> pros. Less invasive for existing queries.
>>>>>> cons. Less intuitive, because they have been optional and now become
>>>>>> mandatory to fall into the second rule.
>>>>>>
>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>> syntax
>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>> 2.
>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>> 3.
>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>
>>>>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

A bit correction: the example I provided for vice versa is not really a
correct case for vice versa. It's actually same case (intended to use rule
2 which is not default) but different result.

On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> My concern is that although we simply think about the change to
> mark "USING provider" to be optional in rule 1, but in reality the change
> is most likely swapping the default rule for CREATE TABLE, which was "rule
> 2", and now "rule 1" (it would be the happy case of migration doc if the
> swap happens as intended), and there're still couple of things which make
> the query still fall into rule 2 which is non-trivial to reason about and
> also not easy to explain.
>
> I only mentioned ROW FORMAT and STORED AS as the case to fall into rule 2
> to simplify the problem statement, but they're not only the case - using
> col_name1:col_type1 would make the query fall into rule 2 regardless of any
> other properties.
>
> What's the matter? In Spark 2.x, if end users want to use rule 1 (which is
> not default) and specify the parameters which are only available in rule 1,
> it clearly requires "USING provider" - parser will throw error if there're
> any mistakes on specifying parameters. End users could set intention
> clearly which rule the query should be bound. If the query fails to bind
> the rule as intended, it's simply denied.
>
> In Spark 3.x, parser may not help figuring out the case where end users
> intend to use rule 2 (which is not default) but some mistake on specifying
> parameters - it could just "silently" bound into rule 1 and it may be even
> executed without any error. Vice versa happens, but in odd way - e.g.
> CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL
> TABLE is not supported.
>
> It's deterministic for end users only if they're fully understand the
> difference between the twos and also understand how Spark would apply the
> rules to make the query fall into one of the rule. I'm not sure how we can
> only improve documentation to make things be clear, but if the approach
> would be explaining the difference of rules and guide the tip to make the
> query be bound to the specific rule, the same could be applied to parser
> rule to address the root cause.
>
>
> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>> native and Hive syntax. I'm fine with some changes to make it less
>> confusing, as long as the user-facing behavior doesn't change. For example,
>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>> is false.
>>
>> I still don't get your point about what's the real problem to end users.
>> There is no ambiguity as the behavior is deterministic, although we rely on
>> optional fields and rule order which is bad. It's hard to document, but I
>> don't think that's a big problem to end users.
>>
>> For the legacy config, it does make the implementation more complicated,
>> but it's invisible to most end users (we don't document it) and can be
>> super useful to some users who want the queries to keep working in 3.0
>> without rewriting.
>>
>> If your only concern is documentation, I totally agree that we should
>> improve it.
>>
>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Thanks for sharing your view.
>>>
>>> I agree with you it's good for Spark to promote Spark's own CREATE TABLE
>>> syntax. The thing is, we still leave Hive CREATE TABLE syntax unchanged -
>>> it's being said as "convenience" but I'm not sure I can agree with.
>>>
>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>> statement be clearer:
>>>
>>> I think the parser implementation around CREATE TABLE brings ambiguity
>>> which is not documented anywhere. It wasn't ambiguous because we forced to
>>> specify USE provider if it's not a Hive table. Now it's either default
>>> provider or Hive according to which options are provided, which seems to be
>>> non-trivial to reason about. (End users would never know, as it's
>>> completely from parser rule.)
>>>
>>> I feel this as the issue of "not breaking old behavior". The parser rule
>>> gets pretty much complicated due to support legacy config. Not breaking
>>> anything would make us be stuck eventually.
>>>
>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>
>>> Since Spark 3.0, CREATE TABLE without a specific provider will use the
>>> value of spark.sql.sources.default as its provider. In Spark version 2.4
>>> and earlier, it was hive. To restore the behavior before Spark 3.0, you can
>>> set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>
>>>
>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't
>>> describe anything for this.
>>>
>>>
>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>
>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1
>>> [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>> AS select_statement ]
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>
>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>
>>>
>>> At least we should describe that parser will try to match the first case
>>> (create table ~ using data source), and fail back to second case; even
>>> though we describe this it's not intuitive to reason about which rule the
>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>> AS" are the options which make DDL query fall into the second case, but
>>> they're described as "optional" so it's hard to catch the gotcha.
>>>
>>> Furthermore, while we document the syntax as above, in reality we allow
>>> "EXTERNAL" in first rule (and throw error), which ends up existing DDL
>>> query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>
>>>
>>> Simply saying, do we really think end users should stop and try to match
>>> their query against the parser rules (or orders when we explain in the doc)
>>> by themselves to understand which provider the table will leverage? I'm
>>> sorry but I think we are making bad assumption on end users which is a
>>> serious problem.
>>>
>>> If we really want to promote Spark's one for CREATE TABLE, then would it
>>> really matter to treat Hive CREATE TABLE be "exceptional" one and try to
>>> isolate each other? What's the point of providing a legacy config to go
>>> back to the old one even we fear about breaking something to make it better
>>> or clearer? We do think that table provider is important (hence the change
>>> was done), then is it still a trivial problem whether the provider is
>>> affected by specifying the "optional" fields?
>>>
>>>
>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>>> syntax instead of the Hive one. Previously these two rules are mutually
>>>> exclusive because the native syntax requires the USING clause while the
>>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>
>>>> It's a good move to make the USING clause optional, which makes it
>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>> order. But the user-facing behavior looks fine.
>>>>
>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
>>>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
>>>> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>
>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> Hi devs,
>>>>>
>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>> SPARK-30098 [1].
>>>>>
>>>>> Previously, "create table" parser rules were clearly distinguished via
>>>>> "USING provider", which was very intuitive and deterministic. Say, DDL
>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>
>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>
>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule to
>>>>> provide better message) brought more confusion (I've described the broken
>>>>> existing query via SPARK-30436 [4]).
>>>>>
>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>> trying to document the difference and talk end users to be careful about
>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>
>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>> "HIVE" TABLE ...`.
>>>>>
>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>> cons. This would lead end users to change their query if they intend
>>>>> to create Hive table. (Given we will also provide legacy option I'm feeling
>>>>> this is acceptable.)
>>>>>
>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>
>>>>> pros. Less invasive for existing queries.
>>>>> cons. Less intuitive, because they have been optional and now become
>>>>> mandatory to fall into the second rule.
>>>>>
>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>> syntax
>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>> 2.
>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>> 3.
>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>
>>>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

My concern is that although we simply think about the change to mark "USING
provider" to be optional in rule 1, but in reality the change is most
likely swapping the default rule for CREATE TABLE, which was "rule 2", and
now "rule 1" (it would be the happy case of migration doc if the swap
happens as intended), and there're still couple of things which make the
query still fall into rule 2 which is non-trivial to reason about and also
not easy to explain.

I only mentioned ROW FORMAT and STORED AS as the case to fall into rule 2
to simplify the problem statement, but they're not only the case - using
col_name1:col_type1 would make the query fall into rule 2 regardless of any
other properties.

What's the matter? In Spark 2.x, if end users want to use rule 1 (which is
not default) and specify the parameters which are only available in rule 1,
it clearly requires "USING provider" - parser will throw error if there're
any mistakes on specifying parameters. End users could set intention
clearly which rule the query should be bound. If the query fails to bind
the rule as intended, it's simply denied.

In Spark 3.x, parser may not help figuring out the case where end users
intend to use rule 2 (which is not default) but some mistake on specifying
parameters - it could just "silently" bound into rule 1 and it may be even
executed without any error. Vice versa happens, but in odd way - e.g.
CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL
TABLE is not supported.

It's deterministic for end users only if they're fully understand the
difference between the twos and also understand how Spark would apply the
rules to make the query fall into one of the rule. I'm not sure how we can
only improve documentation to make things be clear, but if the approach
would be explaining the difference of rules and guide the tip to make the
query be bound to the specific rule, the same could be applied to parser
rule to address the root cause.


On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cl...@gmail.com> wrote:

> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
> native and Hive syntax. I'm fine with some changes to make it less
> confusing, as long as the user-facing behavior doesn't change. For example,
> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
> is false.
>
> I still don't get your point about what's the real problem to end users.
> There is no ambiguity as the behavior is deterministic, although we rely on
> optional fields and rule order which is bad. It's hard to document, but I
> don't think that's a big problem to end users.
>
> For the legacy config, it does make the implementation more complicated,
> but it's invisible to most end users (we don't document it) and can be
> super useful to some users who want the queries to keep working in 3.0
> without rewriting.
>
> If your only concern is documentation, I totally agree that we should
> improve it.
>
> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Thanks for sharing your view.
>>
>> I agree with you it's good for Spark to promote Spark's own CREATE TABLE
>> syntax. The thing is, we still leave Hive CREATE TABLE syntax unchanged -
>> it's being said as "convenience" but I'm not sure I can agree with.
>>
>> I'll quote my comments in SPARK-31136 here again to make the problem
>> statement be clearer:
>>
>> I think the parser implementation around CREATE TABLE brings ambiguity
>> which is not documented anywhere. It wasn't ambiguous because we forced to
>> specify USE provider if it's not a Hive table. Now it's either default
>> provider or Hive according to which options are provided, which seems to be
>> non-trivial to reason about. (End users would never know, as it's
>> completely from parser rule.)
>>
>> I feel this as the issue of "not breaking old behavior". The parser rule
>> gets pretty much complicated due to support legacy config. Not breaking
>> anything would make us be stuck eventually.
>>
>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>
>> Since Spark 3.0, CREATE TABLE without a specific provider will use the
>> value of spark.sql.sources.default as its provider. In Spark version 2.4
>> and earlier, it was hive. To restore the behavior before Spark 3.0, you can
>> set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>
>>
>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't
>> describe anything for this.
>>
>>
>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>
>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1 [
>> COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS ( key1=val1,
>> key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ... ) ] [
>> CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [ ASC |
>> DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [ COMMENT
>> table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [ AS
>> select_statement ]
>>
>>
>>
>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>
>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>
>>
>> At least we should describe that parser will try to match the first case
>> (create table ~ using data source), and fail back to second case; even
>> though we describe this it's not intuitive to reason about which rule the
>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>> AS" are the options which make DDL query fall into the second case, but
>> they're described as "optional" so it's hard to catch the gotcha.
>>
>> Furthermore, while we document the syntax as above, in reality we allow
>> "EXTERNAL" in first rule (and throw error), which ends up existing DDL
>> query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>
>>
>> Simply saying, do we really think end users should stop and try to match
>> their query against the parser rules (or orders when we explain in the doc)
>> by themselves to understand which provider the table will leverage? I'm
>> sorry but I think we are making bad assumption on end users which is a
>> serious problem.
>>
>> If we really want to promote Spark's one for CREATE TABLE, then would it
>> really matter to treat Hive CREATE TABLE be "exceptional" one and try to
>> isolate each other? What's the point of providing a legacy config to go
>> back to the old one even we fear about breaking something to make it better
>> or clearer? We do think that table provider is important (hence the change
>> was done), then is it still a trivial problem whether the provider is
>> affected by specifying the "optional" fields?
>>
>>
>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>> syntax instead of the Hive one. Previously these two rules are mutually
>>> exclusive because the native syntax requires the USING clause while the
>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>
>>> It's a good move to make the USING clause optional, which makes it
>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>> parser rules conflict and we pick the native syntax depending on the rule
>>> order. But the user-facing behavior looks fine.
>>>
>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
>>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
>>> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>
>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> I'd like to initiate discussion and hear the voices for resolving
>>>> ambiguous parser rule between two "create table"s being brought by
>>>> SPARK-30098 [1].
>>>>
>>>> Previously, "create table" parser rules were clearly distinguished via
>>>> "USING provider", which was very intuitive and deterministic. Say, DDL
>>>> query creates "Hive" table unless "USING provider" is specified,
>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>
>>>> After SPARK-30098, "create table" parser rules became ambiguous (please
>>>> refer the parser rule in branch-3.0 [3]) - the factors differentiating two
>>>> rules are only "ROW FORMAT" and "STORED AS" which are all defined as
>>>> "optional". Now it relies on the "order" of parser rule which end users
>>>> would have no idea to reason about, and very unintuitive.
>>>>
>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule to
>>>> provide better message) brought more confusion (I've described the broken
>>>> existing query via SPARK-30436 [4]).
>>>>
>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>> trying to document the difference and talk end users to be careful about
>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>
>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>> "HIVE" TABLE ...`.
>>>>
>>>> pros. This is the simplest way to distinguish between two rules.
>>>> cons. This would lead end users to change their query if they intend to
>>>> create Hive table. (Given we will also provide legacy option I'm feeling
>>>> this is acceptable.)
>>>>
>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>
>>>> pros. Less invasive for existing queries.
>>>> cons. Less intuitive, because they have been optional and now become
>>>> mandatory to fall into the second rule.
>>>>
>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax
>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>> 2.
>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>> 3.
>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>
>>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Wenchen Fan <cl...@gmail.com>.

Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
native and Hive syntax. I'm fine with some changes to make it less
confusing, as long as the user-facing behavior doesn't change. For example,
define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
is false.

I still don't get your point about what's the real problem to end users.
There is no ambiguity as the behavior is deterministic, although we rely on
optional fields and rule order which is bad. It's hard to document, but I
don't think that's a big problem to end users.

For the legacy config, it does make the implementation more complicated,
but it's invisible to most end users (we don't document it) and can be
super useful to some users who want the queries to keep working in 3.0
without rewriting.

If your only concern is documentation, I totally agree that we should
improve it.

On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Thanks for sharing your view.
>
> I agree with you it's good for Spark to promote Spark's own CREATE TABLE
> syntax. The thing is, we still leave Hive CREATE TABLE syntax unchanged -
> it's being said as "convenience" but I'm not sure I can agree with.
>
> I'll quote my comments in SPARK-31136 here again to make the problem
> statement be clearer:
>
> I think the parser implementation around CREATE TABLE brings ambiguity
> which is not documented anywhere. It wasn't ambiguous because we forced to
> specify USE provider if it's not a Hive table. Now it's either default
> provider or Hive according to which options are provided, which seems to be
> non-trivial to reason about. (End users would never know, as it's
> completely from parser rule.)
>
> I feel this as the issue of "not breaking old behavior". The parser rule
> gets pretty much complicated due to support legacy config. Not breaking
> anything would make us be stuck eventually.
>
> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>
> Since Spark 3.0, CREATE TABLE without a specific provider will use the
> value of spark.sql.sources.default as its provider. In Spark version 2.4
> and earlier, it was hive. To restore the behavior before Spark 3.0, you can
> set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>
>
> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't
> describe anything for this.
>
>
> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>
> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1 [
> COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS ( key1=val1,
> key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ... ) ] [
> CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [ ASC |
> DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [ COMMENT
> table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [ AS
> select_statement ]
>
>
>
> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>
> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>
>
> At least we should describe that parser will try to match the first case
> (create table ~ using data source), and fail back to second case; even
> though we describe this it's not intuitive to reason about which rule the
> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
> AS" are the options which make DDL query fall into the second case, but
> they're described as "optional" so it's hard to catch the gotcha.
>
> Furthermore, while we document the syntax as above, in reality we allow
> "EXTERNAL" in first rule (and throw error), which ends up existing DDL
> query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>
>
> Simply saying, do we really think end users should stop and try to match
> their query against the parser rules (or orders when we explain in the doc)
> by themselves to understand which provider the table will leverage? I'm
> sorry but I think we are making bad assumption on end users which is a
> serious problem.
>
> If we really want to promote Spark's one for CREATE TABLE, then would it
> really matter to treat Hive CREATE TABLE be "exceptional" one and try to
> isolate each other? What's the point of providing a legacy config to go
> back to the old one even we fear about breaking something to make it better
> or clearer? We do think that table provider is important (hence the change
> was done), then is it still a trivial problem whether the provider is
> affected by specifying the "optional" fields?
>
>
> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> I think the general guideline is to promote Spark's own CREATE TABLE
>> syntax instead of the Hive one. Previously these two rules are mutually
>> exclusive because the native syntax requires the USING clause while the
>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>
>> It's a good move to make the USING clause optional, which makes it easier
>> to write the native CREATE TABLE syntax. Unfortunately, it leads to some
>> conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>> problem here. If a user just writes CREATE TABLE without USING or ROW
>> FORMAT or STORED AS, does it matter what table we create? Internally the
>> parser rules conflict and we pick the native syntax depending on the rule
>> order. But the user-facing behavior looks fine.
>>
>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
>> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>
>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> I'd like to initiate discussion and hear the voices for resolving
>>> ambiguous parser rule between two "create table"s being brought by
>>> SPARK-30098 [1].
>>>
>>> Previously, "create table" parser rules were clearly distinguished via
>>> "USING provider", which was very intuitive and deterministic. Say, DDL
>>> query creates "Hive" table unless "USING provider" is specified,
>>> (Please refer the parser rule in branch-2.4 [2])
>>>
>>> After SPARK-30098, "create table" parser rules became ambiguous (please
>>> refer the parser rule in branch-3.0 [3]) - the factors differentiating two
>>> rules are only "ROW FORMAT" and "STORED AS" which are all defined as
>>> "optional". Now it relies on the "order" of parser rule which end users
>>> would have no idea to reason about, and very unintuitive.
>>>
>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule to
>>> provide better message) brought more confusion (I've described the broken
>>> existing query via SPARK-30436 [4]).
>>>
>>> Personally I'd like to see two rules mutually exclusive, instead of
>>> trying to document the difference and talk end users to be careful about
>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>
>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>> "HIVE" TABLE ...`.
>>>
>>> pros. This is the simplest way to distinguish between two rules.
>>> cons. This would lead end users to change their query if they intend to
>>> create Hive table. (Given we will also provide legacy option I'm feeling
>>> this is acceptable.)
>>>
>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>
>>> pros. Less invasive for existing queries.
>>> cons. Less intuitive, because they have been optional and now become
>>> mandatory to fall into the second rule.
>>>
>>> Would like to hear everyone's voices; better ideas are welcome!
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
>>> https://issues.apache.org/jira/browse/SPARK-30098
>>> 2.
>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>> 3.
>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>
>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Jungtaek Lim <ka...@gmail.com>.

Thanks for sharing your view.

I agree with you it's good for Spark to promote Spark's own CREATE TABLE
syntax. The thing is, we still leave Hive CREATE TABLE syntax unchanged -
it's being said as "convenience" but I'm not sure I can agree with.

I'll quote my comments in SPARK-31136 here again to make the problem
statement be clearer:

I think the parser implementation around CREATE TABLE brings ambiguity
which is not documented anywhere. It wasn't ambiguous because we forced to
specify USE provider if it's not a Hive table. Now it's either default
provider or Hive according to which options are provided, which seems to be
non-trivial to reason about. (End users would never know, as it's
completely from parser rule.)

I feel this as the issue of "not breaking old behavior". The parser rule
gets pretty much complicated due to support legacy config. Not breaking
anything would make us be stuck eventually.

https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md

Since Spark 3.0, CREATE TABLE without a specific provider will use the
value of spark.sql.sources.default as its provider. In Spark version 2.4
and earlier, it was hive. To restore the behavior before Spark 3.0, you can
set spark.sql.legacy.createHiveTableByDefault.enabled to true.

It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't
describe anything for this.

https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md

CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1 [
COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS ( key1=val1,
key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ... ) ] [
CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [ ASC |
DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [ COMMENT
table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [ AS
select_statement ]

https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md

CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
key1=val1, key2=val2, ... ) ] [ AS select_statement ]

At least we should describe that parser will try to match the first case
(create table ~ using data source), and fail back to second case; even
though we describe this it's not intuitive to reason about which rule the
DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
AS" are the options which make DDL query fall into the second case, but
they're described as "optional" so it's hard to catch the gotcha.

Furthermore, while we document the syntax as above, in reality we allow
"EXTERNAL" in first rule (and throw error), which ends up existing DDL
query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".

Simply saying, do we really think end users should stop and try to match
their query against the parser rules (or orders when we explain in the doc)
by themselves to understand which provider the table will leverage? I'm
sorry but I think we are making bad assumption on end users which is a
serious problem.

If we really want to promote Spark's one for CREATE TABLE, then would it
really matter to treat Hive CREATE TABLE be "exceptional" one and try to
isolate each other? What's the point of providing a legacy config to go
back to the old one even we fear about breaking something to make it better
or clearer? We do think that table provider is important (hence the change
was done), then is it still a trivial problem whether the provider is
affected by specifying the "optional" fields?

On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cl...@gmail.com> wrote:

> I think the general guideline is to promote Spark's own CREATE TABLE
> syntax instead of the Hive one. Previously these two rules are mutually
> exclusive because the native syntax requires the USING clause while the
> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>
> It's a good move to make the USING clause optional, which makes it easier
> to write the native CREATE TABLE syntax. Unfortunately, it leads to some
> conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
> problem here. If a user just writes CREATE TABLE without USING or ROW
> FORMAT or STORED AS, does it matter what table we create? Internally the
> parser rules conflict and we pick the native syntax depending on the rule
> order. But the user-facing behavior looks fine.
>
> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>
> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
> kabhwan.opensource@gmail.com> wrote:
>
>> Hi devs,
>>
>> I'd like to initiate discussion and hear the voices for resolving
>> ambiguous parser rule between two "create table"s being brought by
>> SPARK-30098 [1].
>>
>> Previously, "create table" parser rules were clearly distinguished via
>> "USING provider", which was very intuitive and deterministic. Say, DDL
>> query creates "Hive" table unless "USING provider" is specified,
>> (Please refer the parser rule in branch-2.4 [2])
>>
>> After SPARK-30098, "create table" parser rules became ambiguous (please
>> refer the parser rule in branch-3.0 [3]) - the factors differentiating two
>> rules are only "ROW FORMAT" and "STORED AS" which are all defined as
>> "optional". Now it relies on the "order" of parser rule which end users
>> would have no idea to reason about, and very unintuitive.
>>
>> Furthermore, undocumented rule of EXTERNAL (added in the first rule to
>> provide better message) brought more confusion (I've described the broken
>> existing query via SPARK-30436 [4]).
>>
>> Personally I'd like to see two rules mutually exclusive, instead of
>> trying to document the difference and talk end users to be careful about
>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>
>> 1. Add some identifier in create Hive table rule, like `CREATE ... "HIVE"
>> TABLE ...`.
>>
>> pros. This is the simplest way to distinguish between two rules.
>> cons. This would lead end users to change their query if they intend to
>> create Hive table. (Given we will also provide legacy option I'm feeling
>> this is acceptable.)
>>
>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>
>> pros. Less invasive for existing queries.
>> cons. Less intuitive, because they have been optional and now become
>> mandatory to fall into the second rule.
>>
>> Would like to hear everyone's voices; better ideas are welcome!
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
>> https://issues.apache.org/jira/browse/SPARK-30098
>> 2.
>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>> 3.
>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>
>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Posted by Wenchen Fan <cl...@gmail.com>.

I think the general guideline is to promote Spark's own CREATE TABLE syntax
instead of the Hive one. Previously these two rules are mutually exclusive
because the native syntax requires the USING clause while the Hive
syntax makes ROW FORMAT or STORED AS clause optional.

It's a good move to make the USING clause optional, which makes it easier
to write the native CREATE TABLE syntax. Unfortunately, it leads to some
conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
problem here. If a user just writes CREATE TABLE without USING or ROW
FORMAT or STORED AS, does it matter what table we create? Internally the
parser rules conflict and we pick the native syntax depending on the rule
order. But the user-facing behavior looks fine.

CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0. Shall
we simply remove EXTERNAL from the native CREATE TABLE syntax? Then CREATE
EXTERNAL TABLE creates Hive table like 2.4.

On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Hi devs,
>
> I'd like to initiate discussion and hear the voices for resolving
> ambiguous parser rule between two "create table"s being brought by
> SPARK-30098 [1].
>
> Previously, "create table" parser rules were clearly distinguished via
> "USING provider", which was very intuitive and deterministic. Say, DDL
> query creates "Hive" table unless "USING provider" is specified,
> (Please refer the parser rule in branch-2.4 [2])
>
> After SPARK-30098, "create table" parser rules became ambiguous (please
> refer the parser rule in branch-3.0 [3]) - the factors differentiating two
> rules are only "ROW FORMAT" and "STORED AS" which are all defined as
> "optional". Now it relies on the "order" of parser rule which end users
> would have no idea to reason about, and very unintuitive.
>
> Furthermore, undocumented rule of EXTERNAL (added in the first rule to
> provide better message) brought more confusion (I've described the broken
> existing query via SPARK-30436 [4]).
>
> Personally I'd like to see two rules mutually exclusive, instead of trying
> to document the difference and talk end users to be careful about their
> query. I'm seeing two ways to make rules be mutually exclusive:
>
> 1. Add some identifier in create Hive table rule, like `CREATE ... "HIVE"
> TABLE ...`.
>
> pros. This is the simplest way to distinguish between two rules.
> cons. This would lead end users to change their query if they intend to
> create Hive table. (Given we will also provide legacy option I'm feeling
> this is acceptable.)
>
> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>
> pros. Less invasive for existing queries.
> cons. Less intuitive, because they have been optional and now become
> mandatory to fall into the second rule.
>
> Would like to hear everyone's voices; better ideas are welcome!
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> https://issues.apache.org/jira/browse/SPARK-30098
> 2.
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
> 3.
> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
> 4. https://issues.apache.org/jira/browse/SPARK-30436
>
>