You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2018/10/01 18:20:43 UTC

Re: [DISCUSS] Syntax for table DDL

What do you mean by consistent with the syntax in SqlBase.g4? These aren’t
currently defined, so we need to decide what syntax to support. There are
more details below, but the syntax I’m proposing is more standard across
databases than Hive, which uses confusing and non-standard syntax.

I doubt that we want to support Hive syntax for a few reasons. Hive uses
the same column CHANGE statement for multiple purposes, so it ends up with
strange patterns for simple tasks, like updating the column’s type:

ALTER TABLE t CHANGE a1 a1 INT;

The column name is doubled because old name, new name, and type are always
required. So you have to know the type of a column to change its name and
you have to double up the name to change its type. Hive also allows a
couple other oddities:

   - Column reordering with FIRST and AFTER keywords. Column reordering is
   tricky to get right so I’m not sure we want to add it.
   - RESTRICT and CASCADE to signal whether to change all partitions or
   not. Spark doesn’t support partition-level schemas except through Hive, and
   even then I’m not sure how reliable it is.

I know that we wouldn’t necessarily have to support these features from
Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
if it is unlikely that Spark will implement all of the “features”? I’d
rather go with SQL syntax from databases like PostgreSQL or others that are
more standard and common.

The more “standard” versions of these statements are like what I’ve
proposed:

   - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER is
   used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL and
   Oracle. COLUMN is optional in Oracle and TYPE is omitted by databases
   other than PosgreSQL. I think we could easily add MODIFY as an
   alternative to the second ALTER (and maybe alternatives like UPDATE and
   CHANGE) and make both TYPE and COLUMN optional.
   - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: This
   syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the same
   syntax as Hive and it appears that SQL server doesn’t have this statement.
   This also match the table rename syntax, which uses TO.
   - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
   matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
   optional. Most don’t allow deleting multiple columns, but it’s a reasonable
   extension.

While we’re on the subject of ALTER TABLE DDL, I should note that all of
the databases use ADD COLUMN syntax that differs from Hive (and currently,
Spark):

   - ALTER TABLE ident ADD COLUMN qualifiedName dataType (',' qualifiedName
   dataType)*: All other databases I looked at use ADD COLUMN, but not all
   of them support adding multiple columns at the same time. Hive requires (
   and ) enclosing the columns and uses the COLUMNS keyword instead of
   COLUMN. I think that Spark should be updated to make the parens optional
   and to support both keywords, COLUMN and COLUMNS.

What does everyone think? Is it reasonable to use the more standard syntax
instead of using Hive as a base?

rb

On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <ga...@gmail.com> wrote:

> Are they consistent with the current syntax defined in SqlBase.g4? I think
> we are following the Hive DDL syntax:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>
> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道:
>
>> Hi everyone,
>>
>> I’m currently working on new table DDL statements for v2 tables. For
>> context, the new logical plans for DataSourceV2 require a catalog interface
>> so that Spark can create tables for operations like CTAS. The proposed
>> TableCatalog API also includes an API for altering those tables so we can
>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>> which will make it into upstream Spark when the TableCatalog PR is merged.
>>
>> Since I’m adding new SQL statements that don’t yet exist in Spark, I want
>> to make sure that the syntax I’m using in our branch will match the syntax
>> we add to Spark later. I’m basing this proposed syntax on PostgreSQL
>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>.
>>
>>    - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>    qualifiedName TYPE dataType.
>>    - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN
>>    qualifiedName TO qualifiedName
>>    - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN | COLUMNS)
>>    qualifiedNameList
>>
>> A few notes:
>>
>>    - Using qualifiedName in these rules allows updating nested types,
>>    like point.x.
>>    - Updates and renames can only alter one column, but drop can drop a
>>    list.
>>    - Rename can’t move types and will validate that if the TO name is
>>    qualified, that the prefix matches the original field.
>>    - I’m also changing ADD COLUMN to support adding fields to nested
>>    columns by using qualifiedName instead of identifier.
>>
>> Please reply to this thread if you have suggestions based on a different
>> SQL engine or want this syntax to be different for another reason. Thanks!
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Syntax for table DDL

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Sounds good. I'll plan on adding a PR with Hive's CHANGE syntax in addition
to what I've proposed here.

I have all of these working in our Spark distribution, so I'm just waiting
on finalizing the TableCatalog API to submit these upstream.

On Wed, Oct 3, 2018 at 10:07 PM Wenchen Fan <cl...@gmail.com> wrote:

> Thank you Ryan for proposing the DDL syntax! I think it's good to follow
> mainstream databases, and the proposed syntax looks very reasonable.
>
> About Hive compatibility, I think it's not that important now, but it's
> still good if we keep it. Shall we support the Hive syntax as an
> alternative? It seems not very hard, just a few more ANTLR rules. It will
> be better if we can make ANTLR extensible and allow other data sources to
> define custom SQL syntax.
>
> Anyway I think they are orthogonal. We can go ahead with the proposed
> syntax here, and add Hive compatible syntax later.
>
> On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I'd say that it was important to be compatible with Hive in the past, but
>> that's becoming less important over time. Spark is well established with
>> Hadoop users and I think the focus moving forward should be to make Spark
>> more predictable as a SQL engine for people coming from more traditional
>> databases..
>>
>> That said, I think there is no problem supporting the alter syntax for
>> both Hive/MySQL and the more standard versions.
>>
>> On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> I think it has been an important “selling point” that Spark is “mostly
>>> compatible“ with Hive DDL.
>>>
>>> I have see a lot of teams suffering from switching between Presto and
>>> Hive dialects.
>>>
>>> So one question I have is, we are at a point of switch from Hive
>>> compatible to ANSI SQL, say?
>>>
>>> Perhaps a more critical question, what does it take to get the platform
>>> to support both, by making the ANTLR extensible?
>>>
>>>
>>>
>>> ------------------------------
>>> *From:* Alessandro Solimando <al...@gmail.com>
>>> *Sent:* Tuesday, October 2, 2018 12:35 AM
>>> *To:* rblue@netflix.com
>>> *Cc:* Xiao Li; dev
>>> *Subject:* Re: [DISCUSS] Syntax for table DDL
>>>
>>> I agree with Ryan, a "standard" and more widely adopted syntax is
>>> usually a good idea, with possibly some slight improvements like "bulk
>>> deletion" of columns (especially because both the syntax and the semantics
>>> are clear), rather than stay with Hive syntax at any cost.
>>>
>>> I am personally following this PR with a lot of interest, thanks for all
>>> the work along this direction.
>>>
>>> Best regards,
>>> Alessandro
>>>
>>> On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> What do you mean by consistent with the syntax in SqlBase.g4? These
>>>> aren’t currently defined, so we need to decide what syntax to support.
>>>> There are more details below, but the syntax I’m proposing is more standard
>>>> across databases than Hive, which uses confusing and non-standard syntax.
>>>>
>>>> I doubt that we want to support Hive syntax for a few reasons. Hive
>>>> uses the same column CHANGE statement for multiple purposes, so it
>>>> ends up with strange patterns for simple tasks, like updating the column’s
>>>> type:
>>>>
>>>> ALTER TABLE t CHANGE a1 a1 INT;
>>>>
>>>> The column name is doubled because old name, new name, and type are
>>>> always required. So you have to know the type of a column to change its
>>>> name and you have to double up the name to change its type. Hive also
>>>> allows a couple other oddities:
>>>>
>>>>    - Column reordering with FIRST and AFTER keywords. Column
>>>>    reordering is tricky to get right so I’m not sure we want to add it.
>>>>    - RESTRICT and CASCADE to signal whether to change all partitions
>>>>    or not. Spark doesn’t support partition-level schemas except through Hive,
>>>>    and even then I’m not sure how reliable it is.
>>>>
>>>> I know that we wouldn’t necessarily have to support these features from
>>>> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
>>>> if it is unlikely that Spark will implement all of the “features”? I’d
>>>> rather go with SQL syntax from databases like PostgreSQL or others that are
>>>> more standard and common.
>>>>
>>>> The more “standard” versions of these statements are like what I’ve
>>>> proposed:
>>>>
>>>>    - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER
>>>>    is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL
>>>>    and Oracle. COLUMN is optional in Oracle and TYPE is omitted by
>>>>    databases other than PosgreSQL. I think we could easily add MODIFY
>>>>    as an alternative to the second ALTER (and maybe alternatives like
>>>>    UPDATE and CHANGE) and make both TYPE and COLUMN optional.
>>>>    - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName:
>>>>    This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the
>>>>    same syntax as Hive and it appears that SQL server doesn’t have this
>>>>    statement. This also match the table rename syntax, which uses TO.
>>>>    - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>>>>    matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>>>>    optional. Most don’t allow deleting multiple columns, but it’s a reasonable
>>>>    extension.
>>>>
>>>> While we’re on the subject of ALTER TABLE DDL, I should note that all
>>>> of the databases use ADD COLUMN syntax that differs from Hive (and
>>>> currently, Spark):
>>>>
>>>>    - ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>>>>    qualifiedName dataType)*: All other databases I looked at use ADD
>>>>    COLUMN, but not all of them support adding multiple columns at the
>>>>    same time. Hive requires ( and ) enclosing the columns and uses the
>>>>    COLUMNS keyword instead of COLUMN. I think that Spark should be
>>>>    updated to make the parens optional and to support both keywords,
>>>>    COLUMN and COLUMNS.
>>>>
>>>> What does everyone think? Is it reasonable to use the more standard
>>>> syntax instead of using Hive as a base?
>>>>
>>>> rb
>>>>
>>>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>>> Are they consistent with the current syntax defined in SqlBase.g4? I
>>>>> think we are following the Hive DDL syntax:
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>>>>>
>>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I’m currently working on new table DDL statements for v2 tables. For
>>>>>> context, the new logical plans for DataSourceV2 require a catalog interface
>>>>>> so that Spark can create tables for operations like CTAS. The proposed
>>>>>> TableCatalog API also includes an API for altering those tables so we can
>>>>>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>>>>>> which will make it into upstream Spark when the TableCatalog PR is merged.
>>>>>>
>>>>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I
>>>>>> want to make sure that the syntax I’m using in our branch will match the
>>>>>> syntax we add to Spark later. I’m basing this proposed syntax on
>>>>>> PostgreSQL
>>>>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>.
>>>>>>
>>>>>>    - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>>>>>    qualifiedName TYPE dataType.
>>>>>>    - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN
>>>>>>    qualifiedName TO qualifiedName
>>>>>>    - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN |
>>>>>>    COLUMNS) qualifiedNameList
>>>>>>
>>>>>> A few notes:
>>>>>>
>>>>>>    - Using qualifiedName in these rules allows updating nested
>>>>>>    types, like point.x.
>>>>>>    - Updates and renames can only alter one column, but drop can
>>>>>>    drop a list.
>>>>>>    - Rename can’t move types and will validate that if the TO name
>>>>>>    is qualified, that the prefix matches the original field.
>>>>>>    - I’m also changing ADD COLUMN to support adding fields to nested
>>>>>>    columns by using qualifiedName instead of identifier.
>>>>>>
>>>>>> Please reply to this thread if you have suggestions based on a
>>>>>> different SQL engine or want this syntax to be different for another
>>>>>> reason. Thanks!
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Syntax for table DDL

Posted by Wenchen Fan <cl...@gmail.com>.
Thank you Ryan for proposing the DDL syntax! I think it's good to follow
mainstream databases, and the proposed syntax looks very reasonable.

About Hive compatibility, I think it's not that important now, but it's
still good if we keep it. Shall we support the Hive syntax as an
alternative? It seems not very hard, just a few more ANTLR rules. It will
be better if we can make ANTLR extensible and allow other data sources to
define custom SQL syntax.

Anyway I think they are orthogonal. We can go ahead with the proposed
syntax here, and add Hive compatible syntax later.

On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I'd say that it was important to be compatible with Hive in the past, but
> that's becoming less important over time. Spark is well established with
> Hadoop users and I think the focus moving forward should be to make Spark
> more predictable as a SQL engine for people coming from more traditional
> databases..
>
> That said, I think there is no problem supporting the alter syntax for
> both Hive/MySQL and the more standard versions.
>
> On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> I think it has been an important “selling point” that Spark is “mostly
>> compatible“ with Hive DDL.
>>
>> I have see a lot of teams suffering from switching between Presto and
>> Hive dialects.
>>
>> So one question I have is, we are at a point of switch from Hive
>> compatible to ANSI SQL, say?
>>
>> Perhaps a more critical question, what does it take to get the platform
>> to support both, by making the ANTLR extensible?
>>
>>
>>
>> ------------------------------
>> *From:* Alessandro Solimando <al...@gmail.com>
>> *Sent:* Tuesday, October 2, 2018 12:35 AM
>> *To:* rblue@netflix.com
>> *Cc:* Xiao Li; dev
>> *Subject:* Re: [DISCUSS] Syntax for table DDL
>>
>> I agree with Ryan, a "standard" and more widely adopted syntax is usually
>> a good idea, with possibly some slight improvements like "bulk deletion" of
>> columns (especially because both the syntax and the semantics are clear),
>> rather than stay with Hive syntax at any cost.
>>
>> I am personally following this PR with a lot of interest, thanks for all
>> the work along this direction.
>>
>> Best regards,
>> Alessandro
>>
>> On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> wrote:
>>
>>> What do you mean by consistent with the syntax in SqlBase.g4? These
>>> aren’t currently defined, so we need to decide what syntax to support.
>>> There are more details below, but the syntax I’m proposing is more standard
>>> across databases than Hive, which uses confusing and non-standard syntax.
>>>
>>> I doubt that we want to support Hive syntax for a few reasons. Hive uses
>>> the same column CHANGE statement for multiple purposes, so it ends up
>>> with strange patterns for simple tasks, like updating the column’s type:
>>>
>>> ALTER TABLE t CHANGE a1 a1 INT;
>>>
>>> The column name is doubled because old name, new name, and type are
>>> always required. So you have to know the type of a column to change its
>>> name and you have to double up the name to change its type. Hive also
>>> allows a couple other oddities:
>>>
>>>    - Column reordering with FIRST and AFTER keywords. Column reordering
>>>    is tricky to get right so I’m not sure we want to add it.
>>>    - RESTRICT and CASCADE to signal whether to change all partitions or
>>>    not. Spark doesn’t support partition-level schemas except through Hive, and
>>>    even then I’m not sure how reliable it is.
>>>
>>> I know that we wouldn’t necessarily have to support these features from
>>> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
>>> if it is unlikely that Spark will implement all of the “features”? I’d
>>> rather go with SQL syntax from databases like PostgreSQL or others that are
>>> more standard and common.
>>>
>>> The more “standard” versions of these statements are like what I’ve
>>> proposed:
>>>
>>>    - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER
>>>    is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL
>>>    and Oracle. COLUMN is optional in Oracle and TYPE is omitted by
>>>    databases other than PosgreSQL. I think we could easily add MODIFY
>>>    as an alternative to the second ALTER (and maybe alternatives like
>>>    UPDATE and CHANGE) and make both TYPE and COLUMN optional.
>>>    - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName:
>>>    This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the
>>>    same syntax as Hive and it appears that SQL server doesn’t have this
>>>    statement. This also match the table rename syntax, which uses TO.
>>>    - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>>>    matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>>>    optional. Most don’t allow deleting multiple columns, but it’s a reasonable
>>>    extension.
>>>
>>> While we’re on the subject of ALTER TABLE DDL, I should note that all
>>> of the databases use ADD COLUMN syntax that differs from Hive (and
>>> currently, Spark):
>>>
>>>    - ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>>>    qualifiedName dataType)*: All other databases I looked at use ADD
>>>    COLUMN, but not all of them support adding multiple columns at the
>>>    same time. Hive requires ( and ) enclosing the columns and uses the
>>>    COLUMNS keyword instead of COLUMN. I think that Spark should be
>>>    updated to make the parens optional and to support both keywords,
>>>    COLUMN and COLUMNS.
>>>
>>> What does everyone think? Is it reasonable to use the more standard
>>> syntax instead of using Hive as a base?
>>>
>>> rb
>>>
>>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> Are they consistent with the current syntax defined in SqlBase.g4? I
>>>> think we are following the Hive DDL syntax:
>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>>>>
>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I’m currently working on new table DDL statements for v2 tables. For
>>>>> context, the new logical plans for DataSourceV2 require a catalog interface
>>>>> so that Spark can create tables for operations like CTAS. The proposed
>>>>> TableCatalog API also includes an API for altering those tables so we can
>>>>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>>>>> which will make it into upstream Spark when the TableCatalog PR is merged.
>>>>>
>>>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I
>>>>> want to make sure that the syntax I’m using in our branch will match the
>>>>> syntax we add to Spark later. I’m basing this proposed syntax on
>>>>> PostgreSQL
>>>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>.
>>>>>
>>>>>    - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>>>>    qualifiedName TYPE dataType.
>>>>>    - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN
>>>>>    qualifiedName TO qualifiedName
>>>>>    - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN |
>>>>>    COLUMNS) qualifiedNameList
>>>>>
>>>>> A few notes:
>>>>>
>>>>>    - Using qualifiedName in these rules allows updating nested types,
>>>>>    like point.x.
>>>>>    - Updates and renames can only alter one column, but drop can drop
>>>>>    a list.
>>>>>    - Rename can’t move types and will validate that if the TO name is
>>>>>    qualified, that the prefix matches the original field.
>>>>>    - I’m also changing ADD COLUMN to support adding fields to nested
>>>>>    columns by using qualifiedName instead of identifier.
>>>>>
>>>>> Please reply to this thread if you have suggestions based on a
>>>>> different SQL engine or want this syntax to be different for another
>>>>> reason. Thanks!
>>>>>
>>>>> rb
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Syntax for table DDL

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I'd say that it was important to be compatible with Hive in the past, but
that's becoming less important over time. Spark is well established with
Hadoop users and I think the focus moving forward should be to make Spark
more predictable as a SQL engine for people coming from more traditional
databases..

That said, I think there is no problem supporting the alter syntax for both
Hive/MySQL and the more standard versions.

On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <fe...@hotmail.com>
wrote:

> I think it has been an important “selling point” that Spark is “mostly
> compatible“ with Hive DDL.
>
> I have see a lot of teams suffering from switching between Presto and Hive
> dialects.
>
> So one question I have is, we are at a point of switch from Hive
> compatible to ANSI SQL, say?
>
> Perhaps a more critical question, what does it take to get the platform to
> support both, by making the ANTLR extensible?
>
>
>
> ------------------------------
> *From:* Alessandro Solimando <al...@gmail.com>
> *Sent:* Tuesday, October 2, 2018 12:35 AM
> *To:* rblue@netflix.com
> *Cc:* Xiao Li; dev
> *Subject:* Re: [DISCUSS] Syntax for table DDL
>
> I agree with Ryan, a "standard" and more widely adopted syntax is usually
> a good idea, with possibly some slight improvements like "bulk deletion" of
> columns (especially because both the syntax and the semantics are clear),
> rather than stay with Hive syntax at any cost.
>
> I am personally following this PR with a lot of interest, thanks for all
> the work along this direction.
>
> Best regards,
> Alessandro
>
> On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> wrote:
>
>> What do you mean by consistent with the syntax in SqlBase.g4? These
>> aren’t currently defined, so we need to decide what syntax to support.
>> There are more details below, but the syntax I’m proposing is more standard
>> across databases than Hive, which uses confusing and non-standard syntax.
>>
>> I doubt that we want to support Hive syntax for a few reasons. Hive uses
>> the same column CHANGE statement for multiple purposes, so it ends up
>> with strange patterns for simple tasks, like updating the column’s type:
>>
>> ALTER TABLE t CHANGE a1 a1 INT;
>>
>> The column name is doubled because old name, new name, and type are
>> always required. So you have to know the type of a column to change its
>> name and you have to double up the name to change its type. Hive also
>> allows a couple other oddities:
>>
>>    - Column reordering with FIRST and AFTER keywords. Column reordering
>>    is tricky to get right so I’m not sure we want to add it.
>>    - RESTRICT and CASCADE to signal whether to change all partitions or
>>    not. Spark doesn’t support partition-level schemas except through Hive, and
>>    even then I’m not sure how reliable it is.
>>
>> I know that we wouldn’t necessarily have to support these features from
>> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
>> if it is unlikely that Spark will implement all of the “features”? I’d
>> rather go with SQL syntax from databases like PostgreSQL or others that are
>> more standard and common.
>>
>> The more “standard” versions of these statements are like what I’ve
>> proposed:
>>
>>    - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER
>>    is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL
>>    and Oracle. COLUMN is optional in Oracle and TYPE is omitted by
>>    databases other than PosgreSQL. I think we could easily add MODIFY as
>>    an alternative to the second ALTER (and maybe alternatives like UPDATE
>>    and CHANGE) and make both TYPE and COLUMN optional.
>>    - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName:
>>    This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the
>>    same syntax as Hive and it appears that SQL server doesn’t have this
>>    statement. This also match the table rename syntax, which uses TO.
>>    - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>>    matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>>    optional. Most don’t allow deleting multiple columns, but it’s a reasonable
>>    extension.
>>
>> While we’re on the subject of ALTER TABLE DDL, I should note that all of
>> the databases use ADD COLUMN syntax that differs from Hive (and
>> currently, Spark):
>>
>>    - ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>>    qualifiedName dataType)*: All other databases I looked at use ADD
>>    COLUMN, but not all of them support adding multiple columns at the
>>    same time. Hive requires ( and ) enclosing the columns and uses the
>>    COLUMNS keyword instead of COLUMN. I think that Spark should be
>>    updated to make the parens optional and to support both keywords,
>>    COLUMN and COLUMNS.
>>
>> What does everyone think? Is it reasonable to use the more standard
>> syntax instead of using Hive as a base?
>>
>> rb
>>
>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> Are they consistent with the current syntax defined in SqlBase.g4? I
>>> think we are following the Hive DDL syntax:
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>>>
>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道:
>>>
>>>> Hi everyone,
>>>>
>>>> I’m currently working on new table DDL statements for v2 tables. For
>>>> context, the new logical plans for DataSourceV2 require a catalog interface
>>>> so that Spark can create tables for operations like CTAS. The proposed
>>>> TableCatalog API also includes an API for altering those tables so we can
>>>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>>>> which will make it into upstream Spark when the TableCatalog PR is merged.
>>>>
>>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I
>>>> want to make sure that the syntax I’m using in our branch will match the
>>>> syntax we add to Spark later. I’m basing this proposed syntax on
>>>> PostgreSQL
>>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>.
>>>>
>>>>    - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>>>    qualifiedName TYPE dataType.
>>>>    - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN
>>>>    qualifiedName TO qualifiedName
>>>>    - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN |
>>>>    COLUMNS) qualifiedNameList
>>>>
>>>> A few notes:
>>>>
>>>>    - Using qualifiedName in these rules allows updating nested types,
>>>>    like point.x.
>>>>    - Updates and renames can only alter one column, but drop can drop
>>>>    a list.
>>>>    - Rename can’t move types and will validate that if the TO name is
>>>>    qualified, that the prefix matches the original field.
>>>>    - I’m also changing ADD COLUMN to support adding fields to nested
>>>>    columns by using qualifiedName instead of identifier.
>>>>
>>>> Please reply to this thread if you have suggestions based on a
>>>> different SQL engine or want this syntax to be different for another
>>>> reason. Thanks!
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Syntax for table DDL

Posted by Felix Cheung <fe...@hotmail.com>.
I think it has been an important “selling point” that Spark is “mostly compatible“ with Hive DDL.

I have see a lot of teams suffering from switching between Presto and Hive dialects.

So one question I have is, we are at a point of switch from Hive compatible to ANSI SQL, say?

Perhaps a more critical question, what does it take to get the platform to support both, by making the ANTLR extensible?



________________________________
From: Alessandro Solimando <al...@gmail.com>
Sent: Tuesday, October 2, 2018 12:35 AM
To: rblue@netflix.com
Cc: Xiao Li; dev
Subject: Re: [DISCUSS] Syntax for table DDL

I agree with Ryan, a "standard" and more widely adopted syntax is usually a good idea, with possibly some slight improvements like "bulk deletion" of columns (especially because both the syntax and the semantics are clear), rather than stay with Hive syntax at any cost.

I am personally following this PR with a lot of interest, thanks for all the work along this direction.

Best regards,
Alessandro

On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> wrote:

What do you mean by consistent with the syntax in SqlBase.g4? These aren’t currently defined, so we need to decide what syntax to support. There are more details below, but the syntax I’m proposing is more standard across databases than Hive, which uses confusing and non-standard syntax.

I doubt that we want to support Hive syntax for a few reasons. Hive uses the same column CHANGE statement for multiple purposes, so it ends up with strange patterns for simple tasks, like updating the column’s type:

ALTER TABLE t CHANGE a1 a1 INT;


The column name is doubled because old name, new name, and type are always required. So you have to know the type of a column to change its name and you have to double up the name to change its type. Hive also allows a couple other oddities:

  *   Column reordering with FIRST and AFTER keywords. Column reordering is tricky to get right so I’m not sure we want to add it.
  *   RESTRICT and CASCADE to signal whether to change all partitions or not. Spark doesn’t support partition-level schemas except through Hive, and even then I’m not sure how reliable it is.

I know that we wouldn’t necessarily have to support these features from Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax if it is unlikely that Spark will implement all of the “features”? I’d rather go with SQL syntax from databases like PostgreSQL or others that are more standard and common.

The more “standard” versions of these statements are like what I’ve proposed:

  *   ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL and Oracle. COLUMN is optional in Oracle and TYPE is omitted by databases other than PosgreSQL. I think we could easily add MODIFY as an alternative to the second ALTER (and maybe alternatives like UPDATE and CHANGE) and make both TYPE and COLUMN optional.
  *   ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the same syntax as Hive and it appears that SQL server doesn’t have this statement. This also match the table rename syntax, which uses TO.
  *   ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN optional. Most don’t allow deleting multiple columns, but it’s a reasonable extension.

While we’re on the subject of ALTER TABLE DDL, I should note that all of the databases use ADD COLUMN syntax that differs from Hive (and currently, Spark):

  *   ALTER TABLE ident ADD COLUMN qualifiedName dataType (',' qualifiedName dataType)*: All other databases I looked at use ADD COLUMN, but not all of them support adding multiple columns at the same time. Hive requires ( and ) enclosing the columns and uses the COLUMNS keyword instead of COLUMN. I think that Spark should be updated to make the parens optional and to support both keywords, COLUMN and COLUMNS.

What does everyone think? Is it reasonable to use the more standard syntax instead of using Hive as a base?

rb

On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <ga...@gmail.com>> wrote:
Are they consistent with the current syntax defined in SqlBase.g4? I think we are following the Hive DDL syntax: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column

Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道:

Hi everyone,

I’m currently working on new table DDL statements for v2 tables. For context, the new logical plans for DataSourceV2 require a catalog interface so that Spark can create tables for operations like CTAS. The proposed TableCatalog API also includes an API for altering those tables so we can make ALTER TABLE statements work. I’m implementing those DDL statements, which will make it into upstream Spark when the TableCatalog PR is merged.

Since I’m adding new SQL statements that don’t yet exist in Spark, I want to make sure that the syntax I’m using in our branch will match the syntax we add to Spark later. I’m basing this proposed syntax on PostgreSQL<https://www.postgresql.org/docs/current/static/ddl-alter.html>.

  *   Update data type: ALTER TABLE tableIdentifier ALTER COLUMN qualifiedName TYPE dataType.
  *   Rename column: ALTER TABLE tableIdentifier RENAME COLUMN qualifiedName TO qualifiedName
  *   Drop column: ALTER TABLE tableIdentifier DROP (COLUMN | COLUMNS) qualifiedNameList

A few notes:

  *   Using qualifiedName in these rules allows updating nested types, like point.x.
  *   Updates and renames can only alter one column, but drop can drop a list.
  *   Rename can’t move types and will validate that if the TO name is qualified, that the prefix matches the original field.
  *   I’m also changing ADD COLUMN to support adding fields to nested columns by using qualifiedName instead of identifier.

Please reply to this thread if you have suggestions based on a different SQL engine or want this syntax to be different for another reason. Thanks!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Syntax for table DDL

Posted by Alessandro Solimando <al...@gmail.com>.
I agree with Ryan, a "standard" and more widely adopted syntax is usually a
good idea, with possibly some slight improvements like "bulk deletion" of
columns (especially because both the syntax and the semantics are clear),
rather than stay with Hive syntax at any cost.

I am personally following this PR with a lot of interest, thanks for all
the work along this direction.

Best regards,
Alessandro

On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> wrote:

> What do you mean by consistent with the syntax in SqlBase.g4? These aren’t
> currently defined, so we need to decide what syntax to support. There are
> more details below, but the syntax I’m proposing is more standard across
> databases than Hive, which uses confusing and non-standard syntax.
>
> I doubt that we want to support Hive syntax for a few reasons. Hive uses
> the same column CHANGE statement for multiple purposes, so it ends up
> with strange patterns for simple tasks, like updating the column’s type:
>
> ALTER TABLE t CHANGE a1 a1 INT;
>
> The column name is doubled because old name, new name, and type are always
> required. So you have to know the type of a column to change its name and
> you have to double up the name to change its type. Hive also allows a
> couple other oddities:
>
>    - Column reordering with FIRST and AFTER keywords. Column reordering
>    is tricky to get right so I’m not sure we want to add it.
>    - RESTRICT and CASCADE to signal whether to change all partitions or
>    not. Spark doesn’t support partition-level schemas except through Hive, and
>    even then I’m not sure how reliable it is.
>
> I know that we wouldn’t necessarily have to support these features from
> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
> if it is unlikely that Spark will implement all of the “features”? I’d
> rather go with SQL syntax from databases like PostgreSQL or others that are
> more standard and common.
>
> The more “standard” versions of these statements are like what I’ve
> proposed:
>
>    - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER is
>    used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL and
>    Oracle. COLUMN is optional in Oracle and TYPE is omitted by databases
>    other than PosgreSQL. I think we could easily add MODIFY as an
>    alternative to the second ALTER (and maybe alternatives like UPDATE
>    and CHANGE) and make both TYPE and COLUMN optional.
>    - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: This
>    syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the same
>    syntax as Hive and it appears that SQL server doesn’t have this statement.
>    This also match the table rename syntax, which uses TO.
>    - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>    matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>    optional. Most don’t allow deleting multiple columns, but it’s a reasonable
>    extension.
>
> While we’re on the subject of ALTER TABLE DDL, I should note that all of
> the databases use ADD COLUMN syntax that differs from Hive (and
> currently, Spark):
>
>    - ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>    qualifiedName dataType)*: All other databases I looked at use ADD
>    COLUMN, but not all of them support adding multiple columns at the
>    same time. Hive requires ( and ) enclosing the columns and uses the
>    COLUMNS keyword instead of COLUMN. I think that Spark should be
>    updated to make the parens optional and to support both keywords,
>    COLUMN and COLUMNS.
>
> What does everyone think? Is it reasonable to use the more standard syntax
> instead of using Hive as a base?
>
> rb
>
> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <ga...@gmail.com> wrote:
>
>> Are they consistent with the current syntax defined in SqlBase.g4? I
>> think we are following the Hive DDL syntax:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>>
>> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道:
>>
>>> Hi everyone,
>>>
>>> I’m currently working on new table DDL statements for v2 tables. For
>>> context, the new logical plans for DataSourceV2 require a catalog interface
>>> so that Spark can create tables for operations like CTAS. The proposed
>>> TableCatalog API also includes an API for altering those tables so we can
>>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>>> which will make it into upstream Spark when the TableCatalog PR is merged.
>>>
>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I
>>> want to make sure that the syntax I’m using in our branch will match the
>>> syntax we add to Spark later. I’m basing this proposed syntax on
>>> PostgreSQL
>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>.
>>>
>>>    - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>>    qualifiedName TYPE dataType.
>>>    - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN
>>>    qualifiedName TO qualifiedName
>>>    - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN | COLUMNS)
>>>    qualifiedNameList
>>>
>>> A few notes:
>>>
>>>    - Using qualifiedName in these rules allows updating nested types,
>>>    like point.x.
>>>    - Updates and renames can only alter one column, but drop can drop a
>>>    list.
>>>    - Rename can’t move types and will validate that if the TO name is
>>>    qualified, that the prefix matches the original field.
>>>    - I’m also changing ADD COLUMN to support adding fields to nested
>>>    columns by using qualifiedName instead of identifier.
>>>
>>> Please reply to this thread if you have suggestions based on a different
>>> SQL engine or want this syntax to be different for another reason. Thanks!
>>>
>>> rb
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>