You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Rui Wang <ru...@google.com> on 2018/10/09 18:02:08 UTC

Re: [Discuss] Add EXTERNAL keyword to CREATE TABLE statement

The change of documentation on Beam website is done (see:
create-external-table
<https://beam.apache.org/documentation/dsls/sql/create-external-table/>).
This proposal is finally transformed into master branch. Thanks all.

-Rui

On Fri, Sep 14, 2018 at 12:29 PM Anton Kedin <ke...@google.com> wrote:

> Raising this topic once more. The PR[1] has been open for a while, if
> there is no further input, I'm going to merge it by end of day.
>
> [1]: https://github.com/apache/beam/pull/6252
>
> Thank you,
> Anton
>
>
> On Wed, Aug 15, 2018 at 10:48 PM Tim <ti...@gmail.com> wrote:
>
>> +1 for CREATE EXTERNAL TABLE with similar reasoning given by others on
>> this thread.
>>
>> Tim
>>
>> On 15 Aug 2018, at 23:01, Charles Chen <cc...@google.com> wrote:
>>
>> +1 for CREATE EXTERNAL TABLE.  It is a good balance between the general
>> SQL expectation of having tables as an abstraction and reinforcing that
>> Beam does not store your data.
>>
>> On Wed, Aug 15, 2018 at 1:58 PM Rui Wang <ru...@google.com> wrote:
>>
>>> >  I think users will be more confused to find that 'CREATE TABLE'
>>> doesn't exist then to learn that it might not always create a table.
>>>
>>> >> I think that having CREATE TABLE do something unexpected or not do
>>> something expected (or do the opposite things depending on the table type
>>> or some flag) is worse than having users look up the correct way of
>>> creating a data source in Beam SQL without expecting something we don't
>>> promise.
>>>
>>> I agree on this. Enforcing users to look up documentation for the
>>> correct way is better than letting them use an ambiguous way that could
>>> fail their expectation.
>>>
>>>
>>> -Rui
>>>
>>> On Wed, Aug 15, 2018 at 1:46 PM Anton Kedin <ke...@google.com> wrote:
>>>
>>>> I think that something unique along the lines of `REGISTER EXTERNAL
>>>> DATA SOURCE` is probably fine, as it doesn't conflict with existing
>>>> behaviors of other dialects.
>>>>
>>>> > There is a lot of value in making sure our common operations closely
>>>> map to the equivalent common operations in other SQL dialects.
>>>>
>>>> We're trying to make opposite points using the same arguments :) A lot
>>>> of popular dialects make difference between CREATE TABLE and CREATE
>>>> EXTERNAL TABLE (or similar):
>>>>  - T-SQL:
>>>>       create:
>>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql
>>>>       create external:
>>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
>>>>       external datasource:
>>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
>>>>  - PL/SQL:
>>>>       create:
>>>> https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#i1106369
>>>>       create external:
>>>> https://docs.oracle.com/cd/B19306_01/server.102/b14215/et_concepts.htm#i1009127
>>>>  - postgres:
>>>>       import foreign schema:
>>>> https://www.postgresql.org/docs/9.5/static/sql-importforeignschema.html
>>>>       create table:
>>>> https://www.postgresql.org/docs/9.1/static/sql-createtable.html
>>>>  - redshift:
>>>>       create external schema:
>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
>>>>       create table:
>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
>>>>  - hive internal and external:
>>>> https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables
>>>>
>>>> My understanding is that the behavior of create table is somewhat
>>>> similar in all of the above dialects, from the high-level perspective it
>>>> usually creates a persistent table in the current storage context
>>>> (database). That's not what Beam SQL's create table does right now, and my
>>>> opinion is that it should not be called create table for this reason.
>>>>
>>>> >  I think users will be more confused to find that 'CREATE TABLE'
>>>> doesn't exist then to learn that it might not always create a table.
>>>>
>>>> I think that having CREATE TABLE do something unexpected or not do
>>>> something expected (or do the opposite things depending on the table type
>>>> or some flag) is worse than having users look up the correct way of
>>>> creating a data source in Beam SQL without expecting something we don't
>>>> promise.
>>>>
>>>> >  (For example, a user guessing at the syntax of CREATE TABLE would
>>>> have a better experience with the error being "field LOCATION not
>>>> specified" rather than "operation CREATE TABLE not found".)
>>>>
>>>> They have to look it up anyway (what format is location for a Pubsub
>>>> topic? or is it a subscription?), and when doing so I think it would be
>>>> less confusing to read that to get data from Pubsub/Kafka/... in Beam SQL
>>>> you have to do something like `REGISTER EXTERNAL DATA SOURCE` than `CREATE
>>>> TABLE`.
>>>>
>>>> External tables and schemas don't have a standard approach and I don't
>>>> have a strong preference between any one from the above.
>>>>
>>>> On Wed, Aug 15, 2018 at 1:08 PM Rui Wang <ru...@google.com> wrote:
>>>>
>>>>> Adding dev@ back now.
>>>>>
>>>>> -Rui
>>>>>
>>>>> On Wed, Aug 15, 2018 at 1:01 PM Andrew Pilloud <ap...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Did we drop the dev list from this on purpose? (I haven't added it
>>>>>> back, but we probably should.)
>>>>>>
>>>>>> I'm in favor of sticking with the simple 'CREATE TABLE' and 'CREATE
>>>>>> SCHEMA' if there is only to be one option. Sticking with those names
>>>>>> minimizes both our deviation from other implementations and user surprise.
>>>>>> There is a lot of value in making sure our common operations closely map to
>>>>>> the equivalent common operations in other SQL dialects. I think users will
>>>>>> be more confused to find that 'CREATE TABLE' doesn't exist then to learn
>>>>>> that it might not always create a table. This minimizes the overhead of
>>>>>> learning our dialect of SQL and maximizes the odds that a user will be able
>>>>>> to guess at the syntax of something and have it work. (For example, a user
>>>>>> guessing at the syntax of CREATE TABLE would have a better experience with
>>>>>> the error being "field LOCATION not specified" rather than "operation
>>>>>> CREATE TABLE not found".)
>>>>>>
>>>>>> If the goal is clarity of the operation, how about 'REGISTER EXTERNAL DATA
>>>>>> SOURCE' and 'REGISTER EXTERNAL DATA SOURCE PROVIDER'? Those names
>>>>>> remove the ambiguity around the operation creating and the data source
>>>>>> being a table.
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Wed, Aug 15, 2018 at 10:54 AM Anton Kedin <ke...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> My preference is to make `EXTERNAL` mandatory and only support
>>>>>>> `CREATE EXTERNAL TABLE` for existing semantics. My main reasons are:
>>>>>>>  - user friendliness, matching expectations, readability. Current
>>>>>>> `CREATE TABLE` is basically a `CREATE EXTERNAL TABLE`. It is confusing to
>>>>>>> users familiar with SQL who expect that `CREATE TABLE` will actually create
>>>>>>> a table;
>>>>>>>  - forward-compatibility. We could potentially support non-external
>>>>>>> `CREATE TABLE` at some point in the future, whatever semantics it might
>>>>>>> have. It will be wrong to use the same syntax for external and non-external
>>>>>>> CREATEs;
>>>>>>>
>>>>>>> I agree that typing extra word each time is not ideal, but my
>>>>>>> opinion is on the side that readability of code (including SQL) is
>>>>>>> important (how much time you spend reading / understanding code vs writing
>>>>>>> it) and we should try to improve it if we can. In case of DDL every
>>>>>>> non-trivial statement will already have a ton of unavoidable words (field
>>>>>>> names, types, location, options) so I would argue that adding extra one
>>>>>>> word would not noticeably reduce your happiness of writing it :) But it
>>>>>>> would improve readability and reduce ambiguity, which I think is worth it.
>>>>>>>
>>>>>>> I think that making it optional only introduces more confusion (e.g.
>>>>>>> what's the difference between the two DDL statements without reading the
>>>>>>> doc?) and would make situation worse.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Anton
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 15, 2018 at 10:24 AM Mingmin Xu <mi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I prefer to `CREATE EXTERNAL TABLE`. My question is, do you plan to
>>>>>>>> support both `CREATE TABLE` and `CREATE EXTERNAL TABLE`, by making
>>>>>>>> `EXTERNAL` as optional?
>>>>>>>>
>>>>>>>> On Wed, Aug 15, 2018 at 10:01 AM, Andrew Pilloud <
>>>>>>>> apilloud@google.com> wrote:
>>>>>>>>
>>>>>>>>> I think 'CREATE EXTERNAL TABLE' might make things a bit clearer
>>>>>>>>> from a documentation prospective, but I'd be really unhappy if I had to
>>>>>>>>> type out 'EXTERNAL' every time. (I have the same concern with 'CREATE
>>>>>>>>> EXTERNAL SCHEMA'.)
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> On Tue, Aug 14, 2018 at 12:38 PM Rui Wang <ru...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> I know you are probably using CREATE TABLE, Can I know your
>>>>>>>>>> thoughts on this?
>>>>>>>>>>
>>>>>>>>>> -Rui
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 14, 2018 at 10:22 AM Rui Wang <ru...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Mikhail! "Import" is an alternative option. It might be
>>>>>>>>>>> better.
>>>>>>>>>>>
>>>>>>>>>>> "create external" is being widely used by different systems with
>>>>>>>>>>> similar meaning so "create" usually is ok to external data sources.
>>>>>>>>>>>
>>>>>>>>>>> -Rui
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 14, 2018 at 9:38 AM Mikhail Gryzykhin <
>>>>>>>>>>> migryz@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The idea of clarification sounds good to me. I'd appreciate
>>>>>>>>>>>> that present, when I was triaging post-commit tests.
>>>>>>>>>>>>
>>>>>>>>>>>> Do we have any terms that specify connection to external table?
>>>>>>>>>>>> "CREATE" word triggers this reaction in my brain that there will be a new
>>>>>>>>>>>> table created. Adding "EXTERNAL" would already add distinction, but adding
>>>>>>>>>>>> something more explicit for the task might be even better.
>>>>>>>>>>>>
>>>>>>>>>>>> --Mikhail
>>>>>>>>>>>>
>>>>>>>>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 13, 2018 at 2:40 PM Rafael Fernandez <
>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Strictly speaking, they are not necessarily tables either. We
>>>>>>>>>>>>> could also introduce something like CREATE EXTERNAL DATA SOURCE (a-la
>>>>>>>>>>>>> T-SQL
>>>>>>>>>>>>> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017>),
>>>>>>>>>>>>> if it's somehow advantageous for us to leverage access patterns or restrict
>>>>>>>>>>>>> DML statements.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think your idea of CREATE EXTERNAL TABLE is practical :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 13, 2018 at 2:12 PM Rui Wang <ru...@google.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Community,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BeamSQL allows CREATE TABLE
>>>>>>>>>>>>>> <https://beam.apache.org/documentation/dsls/sql/create-table/>
>>>>>>>>>>>>>> statements to register virtual tables from external storage systems (e.g.
>>>>>>>>>>>>>> BigQuery).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BeamSQL is not a storage system, so any table registered by
>>>>>>>>>>>>>> "CREATE TABLE" statement is essentially equivalent to be registered by
>>>>>>>>>>>>>> "CREATE EXTERNAL TABLE", which requires the user to provide a LOCATION and
>>>>>>>>>>>>>> BeamSQL will register the table outside of current execution environment
>>>>>>>>>>>>>> based on LOCATION.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So I propose to add EXTERNAL keyword to "CREATE TABLE" in
>>>>>>>>>>>>>> BeamSQL to help users understand they are registering tables, and BeamSQL
>>>>>>>>>>>>>> does not create non existing tables by running CREATE TABLE (at least on
>>>>>>>>>>>>>> some storage systems, if not all).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We can make the EXTERNAL keyword either required or optional.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we make the EXTERNAL keyword required:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Pros:
>>>>>>>>>>>>>> a. We can get rid of the registering table semantic on CREATE
>>>>>>>>>>>>>> TABLE.
>>>>>>>>>>>>>> b, We keep the room that we could add CREATE TABLE back in
>>>>>>>>>>>>>> the future if we want CREATE TABLE to create, rather than not only register
>>>>>>>>>>>>>> tables in BeamSQL.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cons:
>>>>>>>>>>>>>> 1. CREATE TABLE syntax will not be supported so existing
>>>>>>>>>>>>>> BeamSQL pipelines which has CREATE TABLE require changes.
>>>>>>>>>>>>>> 2. It's required to type tedious EXTERNAL keyword every time,
>>>>>>>>>>>>>> especially in SQL Shell.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we make the EXTERNAL keyword optional, we will have
>>>>>>>>>>>>>> reversed pros and cons above.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any thoughts on adding EXTERNAL keyword, and make it required
>>>>>>>>>>>>>> or optional?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Rui
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ----
>>>>>>>> Mingmin
>>>>>>>>
>>>>>>>