You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Wenchen Fan <cl...@gmail.com> on 2020/10/06 14:06:28 UTC

Official support of CREATE EXTERNAL TABLE

Hi all,

I'd like to start a discussion thread about this topic, as it blocks an
important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
... USING parquet`, the parser fails and tells you that EXTERNAL can't be
specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
LOCATION clause or path option is present. For example, `CREATE EXTERNAL
TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause
or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal
with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was,
or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you
used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
It seems to me that it only makes sense for file source, as the table
directory can be managed. I'm not sure how to interpret EXTERNAL in
catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in
https://github.com/apache/spark/pull/28026

Thanks,
Wenchen

Re:Official support of CREATE EXTERNAL TABLE

Posted by 大啊 <be...@163.com>.

Personally, I think EXTERNAL is a special feture supported by Hive.
If Spark SQL want support it, only consider it for Hive.
We only unify `CREATE EXTERNAL TABLE in parser and check for unsupported data sources.










At 2020-10-06 22:06:28, "Wenchen Fan" <cl...@gmail.com> wrote:

Hi all,



I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.


A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.


When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.


When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.


As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.


Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.


For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026


Thanks,
Wenchen

Re: Official support of CREATE EXTERNAL TABLE

Posted by Russell Spitzer <ru...@gmail.com>.

I don't feel differently than I did on the thread linked above, I think
treating "External" as a table option is still the safest way to go about
things. For the Cassandra catalog this option wouldn't appear on our
whitelist of allowed options, the same as "path" and other options that
don't apply to C.

On Tue, Oct 6, 2020 at 3:54 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I would summarize both the problem and the current state differently.
>
> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
> compatibility with Hive SQL* because all combinations of EXTERNAL and
> LOCATION are valid in Hive, but creating an external table with a default
> location is not allowed by Spark. Note that Spark must still handle these
> tables because it shares a metastore with Hive, which can still create them.
>
> Now catalogs can be plugged in, the question is whether to pass the fact
> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
> handling a create command, or to suppress it and apply Spark’s rule that
> LOCATION must be present.
>
> If it is not passed to the catalog, then a Hive catalog cannot implement
> the behavior of Hive SQL, even though Spark added the keyword for Hive
> compatibility. The Spark catalog can interpret EXTERNAL however Spark
> chooses to, but I think it is a poor choice to force different behavior on
> other catalogs.
>
> Wenchen has also argued that the purpose of this is to standardize
> behavior across catalogs. But hiding EXTERNAL would not accomplish that
> goal. Whether to physically delete data is a choice that is up to the
> catalog. Some catalogs have no “external” concept and will always drop data
> when a table is dropped. The ability to keep underlying data files is
> specific to a few catalogs, and whether that is controlled by EXTERNAL,
> the LOCATION clause, or something else is still up to the catalog
> implementation.
>
> I don’t think that there is a good reason to force catalogs to break
> compatibility with Hive SQL, while making it appear as though DDL is
> compatible. Because removing EXTERNAL would be a breaking change to the
> SQL parser, I think the best option is to pass it to v2 catalogs so the
> catalog can decide how to handle it.
>
> rb
>
> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'd like to start a discussion thread about this topic, as it blocks an
>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>> syntax.
>>
>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>> feature in Spark for Hive compatibility.
>>
>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
>> ... USING parquet`, the parser fails and tells you that EXTERNAL can't
>> be specified.
>>
>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
>> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
>> clause or path option. This is not 100% Hive compatible.
>>
>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>> was, or we can officially support it.
>>
>> Please let us know your thoughts:
>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>> you used it in production before? For what use cases?
>> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
>> It seems to me that it only makes sense for file source, as the table
>> directory can be managed. I'm not sure how to interpret EXTERNAL in
>> catalogs like jdbc, cassandra, etc.
>>
>> For more details, please refer to the long discussion in
>> https://github.com/apache/spark/pull/28026
>>
>> Thanks,
>> Wenchen
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Official support of CREATE EXTERNAL TABLE

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don’t think Spark ever claims to be 100% Hive compatible.

I just found some relevant documentation
<https://docs.databricks.com/spark/latest/spark-sql/compatibility/hive.html#apache-hive-compatibility>
on this, where Databricks claims that “Apache Spark SQL in Databricks is
designed to be compatible with the Apache Hive, including metastore
connectivity, SerDes, and UDFs.”

There is a strong expectation that Spark is compatible with Hive, and the
Spark community has made the claim. That isn’t a claim of 100%
compatibility, but no one suggested that there was 100% compatibility.

On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue <rb...@netflix.com> wrote:

> I don’t think Spark ever claims to be 100% Hive compatible.
>
> By accepting the EXTERNAL keyword in some circumstances, Spark is
> providing compatibility with Hive DDL. Yes, there are places where it
> breaks. The question is whether we should deliberately break what a Hive
> catalog could implement, when we know what Hive’s behavior is.
>
> CREATE EXTERNAL TABLE is not a Hive-specific feature
>
> Great. So there are other catalogs that could use it. Why should Spark
> choose to limit Hive’s interpretation of this keyword?
>
> While it is great that we seem to agree that Spark shouldn’t do this — now
> that Nessie was pointed out — I’m concerned that you still seem to think
> this is a choice that Spark could reasonably make. *Spark cannot
> arbitrarily choose how to interpret DDL for an external catalog*.
>
> You may not consider this arbitrary because there are other examples where
> location is required. But the Hive community made the choice that these
> clauses are orthogonal, so it is clearly a choice of the external system,
> and it is not Spark’s role to dictate how an external database should
> behave.
>
> I think the Nessie use case is good enough to justify the decoupling of
> EXTERNAL and LOCATION.
>
> It appears that we have consensus. This will be passed to catalogs, which
> can implement the behavior that they choose.
>
> On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> > I have some hive queries that I want to run on Spark.
>>
>> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
>> LOCATION can't help you too much here. If you do have this use case, we
>> need a much wider discussion about how to achieve it.
>>
>> For this particular topic, we need concrete use cases like Nessie
>> <https://projectnessie.org/tools/hive/>. It will be great to see more
>> concrete use cases, but I think the Nessie use case is good enough to
>> justify the decoupling of EXTERNAL and LOCATION.
>>
>> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
>> have it. That's why I think Hive-compatibility alone is not a reasonable
>> use case. For your reference:
>> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> <https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
>> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> <https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
>> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or
>> FILE_NAME option: doc
>> <https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
>> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
>> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>
>>
>> > with which Spark claims to be compatible
>>
>> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
>> diverged from Hive intentionally in several places, where we think the Hive
>> behavior was not reasonable and we shouldn't follow it.
>>
>> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>>> external table.
>>>
>>> I think there is some confusion about what Spark has to handle.
>>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>>> MetaStore that Spark connects to, and the general expectation is that Spark
>>> doesn’t change the meaning of table configuration. There are notable bugs
>>> where Spark has different behavior, but that is the expectation.
>>>
>>> In this particular case, we’re talking about what can be expressed in
>>> DDL that is sent to an external catalog. Spark could (unwisely) choose to
>>> disallow some DDL combinations, but the table is implemented through a
>>> plugin so the interpretation is up to the plugin. Spark has no role in
>>> choosing how to treat this table, unless it is loaded through Spark’s
>>> built-in catalog; in which case, see above.
>>>
>>> I don’t think Hive compatibility itself is a “use case”.
>>>
>>> Why?
>>>
>>> Hive is an external database that defines its own behavior and with
>>> which Spark claims to be compatible. If Hive isn’t a valid use case, then
>>> why is EXTERNAL supported at all?
>>>
>>> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>>
>>>>> I don't think Hive compatibility itself is a "use case".
>>>>>
>>>> Ok let's add on top of this: I have some hive queries that I want to
>>>> run on Spark. I believe that makes it a use case.
>>>>
>>>>> The Nessie <https://projectnessie.org/tools/hive/> example you
>>>>> mentioned is a reasonable use case to me: some frameworks/applications want
>>>>> to create external tables without user-specified location, so that they can
>>>>> manage the table directory themselves and implement fancy features.
>>>>>
>>>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION.
>>>>> We should clearly document that, EXTERNAL and LOCATION are only applicable
>>>>> for file-based data sources, and catalog implementation should fail if the
>>>>> table has EXTERNAL or LOCATION property, but the table provider is not
>>>>> file-based.
>>>>>
>>>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as
>>>>> an external table. Hive gives warning when you create managed tables with
>>>>> custom location, which means this behavior is not recommended. Shall we
>>>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>>>>
>>>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>>>>
>>>>>> The keyword came from Hive and we all agree that a Hive catalog with
>>>>>> Hive behavior can’t be implemented if Spark chooses to couple this with
>>>>>> LOCATION. Why is this use case not a justification?
>>>>>>
>>>>>> Also, the option to keep behavior the same as before is not mutually
>>>>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to
>>>>>> have the same behavior in its catalog. But Spark cannot just choose to
>>>>>> break compatibility with external systems by deciding when to fail certain
>>>>>> combinations of DDL options. Choosing not to allow external without
>>>>>> location when it is valid for Hive prevents building a compatible catalog.
>>>>>>
>>>>>> There are many reasons to build a Hive-compatible catalog. A great
>>>>>> recent example is Nessie <https://projectnessie.org/tools/hive/>,
>>>>>> which enables branching and tagging table states across several table
>>>>>> formats and aims to be compatible with Hive.
>>>>>>
>>>>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > As someone who's had the job of porting different SQL dialects to
>>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>>>>
>>>>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2
>>>>>>> options we are discussing are:
>>>>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must
>>>>>>> co-exists with LOCATION (or path option).
>>>>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>>>>
>>>>>>> I'm fine with option 2 if there are reasonable use cases. I think
>>>>>>> it's always safer to keep the behavior the same as before. If we want to
>>>>>>> change the behavior and follow option 2, we need use cases to justify it.
>>>>>>>
>>>>>>> For now, the only use case I see is for Hive compatibility and allow
>>>>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>>>>>> cases we are targeting?
>>>>>>>
>>>>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> As someone who's had the job of porting different SQL dialects to
>>>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>>>>>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>>>>>>> sense.
>>>>>>>>
>>>>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would summarize both the problem and the current state
>>>>>>>>> differently.
>>>>>>>>>
>>>>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility
>>>>>>>>> with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table
>>>>>>>>> with EXTERNAL unless LOCATION is also present. *This “hidden
>>>>>>>>> feature” breaks compatibility with Hive SQL* because all
>>>>>>>>> combinations of EXTERNAL and LOCATION are valid in Hive, but
>>>>>>>>> creating an external table with a default location is not allowed by Spark.
>>>>>>>>> Note that Spark must still handle these tables because it shares a
>>>>>>>>> metastore with Hive, which can still create them.
>>>>>>>>>
>>>>>>>>> Now catalogs can be plugged in, the question is whether to pass
>>>>>>>>> the fact that EXTERNAL was in the CREATE TABLE statement to the
>>>>>>>>> v2 catalog handling a create command, or to suppress it and apply Spark’s
>>>>>>>>> rule that LOCATION must be present.
>>>>>>>>>
>>>>>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>>>>>> implement the behavior of Hive SQL, even though Spark added the keyword for
>>>>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL
>>>>>>>>> however Spark chooses to, but I think it is a poor choice to force
>>>>>>>>> different behavior on other catalogs.
>>>>>>>>>
>>>>>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>>>>>> behavior across catalogs. But hiding EXTERNAL would not
>>>>>>>>> accomplish that goal. Whether to physically delete data is a choice that is
>>>>>>>>> up to the catalog. Some catalogs have no “external” concept and will always
>>>>>>>>> drop data when a table is dropped. The ability to keep underlying data
>>>>>>>>> files is specific to a few catalogs, and whether that is controlled by
>>>>>>>>> EXTERNAL, the LOCATION clause, or something else is still up to
>>>>>>>>> the catalog implementation.
>>>>>>>>>
>>>>>>>>> I don’t think that there is a good reason to force catalogs to
>>>>>>>>> break compatibility with Hive SQL, while making it appear as though DDL is
>>>>>>>>> compatible. Because removing EXTERNAL would be a breaking change
>>>>>>>>> to the SQL parser, I think the best option is to pass it to v2 catalogs so
>>>>>>>>> the catalog can decide how to handle it.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I'd like to start a discussion thread about this topic, as it
>>>>>>>>>> blocks an important feature that we target for Spark 3.1: unify the CREATE
>>>>>>>>>> TABLE SQL syntax.
>>>>>>>>>>
>>>>>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a
>>>>>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>>>>>
>>>>>>>>>> When you write native CREATE TABLE syntax such as `CREATE
>>>>>>>>>> EXTERNAL TABLE ... USING parquet`, the parser fails and tells
>>>>>>>>>> you that EXTERNAL can't be specified.
>>>>>>>>>>
>>>>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be
>>>>>>>>>> specified if LOCATION clause or path option is present. For example, `CREATE
>>>>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is
>>>>>>>>>> no LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>>>>>>
>>>>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is
>>>>>>>>>> how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature
>>>>>>>>>> as it was, or we can officially support it.
>>>>>>>>>>
>>>>>>>>>> Please let us know your thoughts:
>>>>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to
>>>>>>>>>> do? Have you used it in production before? For what use cases?
>>>>>>>>>> 2. As a catalog developer, how are you going to implement
>>>>>>>>>> EXTERNAL TABLE? It seems to me that it only makes sense for file source, as
>>>>>>>>>> the table directory can be managed. I'm not sure how to interpret EXTERNAL
>>>>>>>>>> in catalogs like jdbc, cassandra, etc.
>>>>>>>>>>
>>>>>>>>>> For more details, please refer to the long discussion in
>>>>>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Wenchen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Official support of CREATE EXTERNAL TABLE

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don’t think Spark ever claims to be 100% Hive compatible.

By accepting the EXTERNAL keyword in some circumstances, Spark is providing
compatibility with Hive DDL. Yes, there are places where it breaks. The
question is whether we should deliberately break what a Hive catalog could
implement, when we know what Hive’s behavior is.

CREATE EXTERNAL TABLE is not a Hive-specific feature

Great. So there are other catalogs that could use it. Why should Spark
choose to limit Hive’s interpretation of this keyword?

While it is great that we seem to agree that Spark shouldn’t do this — now
that Nessie was pointed out — I’m concerned that you still seem to think
this is a choice that Spark could reasonably make. *Spark cannot
arbitrarily choose how to interpret DDL for an external catalog*.

You may not consider this arbitrary because there are other examples where
location is required. But the Hive community made the choice that these
clauses are orthogonal, so it is clearly a choice of the external system,
and it is not Spark’s role to dictate how an external database should
behave.

I think the Nessie use case is good enough to justify the decoupling of
EXTERNAL and LOCATION.

It appears that we have consensus. This will be passed to catalogs, which
can implement the behavior that they choose.

On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan <cl...@gmail.com> wrote:

> > I have some hive queries that I want to run on Spark.
>
> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
> LOCATION can't help you too much here. If you do have this use case, we
> need a much wider discussion about how to achieve it.
>
> For this particular topic, we need concrete use cases like Nessie
> <https://projectnessie.org/tools/hive/>. It will be great to see more
> concrete use cases, but I think the Nessie use case is good enough to
> justify the decoupling of EXTERNAL and LOCATION.
>
> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
> have it. That's why I think Hive-compatibility alone is not a reasonable
> use case. For your reference:
> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> <https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> <https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME
> option: doc
> <https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>
>
> > with which Spark claims to be compatible
>
> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
> diverged from Hive intentionally in several places, where we think the Hive
> behavior was not reasonable and we shouldn't follow it.
>
> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table.
>>
>> I think there is some confusion about what Spark has to handle.
>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>> MetaStore that Spark connects to, and the general expectation is that Spark
>> doesn’t change the meaning of table configuration. There are notable bugs
>> where Spark has different behavior, but that is the expectation.
>>
>> In this particular case, we’re talking about what can be expressed in DDL
>> that is sent to an external catalog. Spark could (unwisely) choose to
>> disallow some DDL combinations, but the table is implemented through a
>> plugin so the interpretation is up to the plugin. Spark has no role in
>> choosing how to treat this table, unless it is loaded through Spark’s
>> built-in catalog; in which case, see above.
>>
>> I don’t think Hive compatibility itself is a “use case”.
>>
>> Why?
>>
>> Hive is an external database that defines its own behavior and with which
>> Spark claims to be compatible. If Hive isn’t a valid use case, then why is
>> EXTERNAL supported at all?
>>
>> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>>
>>>
>>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> I don't think Hive compatibility itself is a "use case".
>>>>
>>> Ok let's add on top of this: I have some hive queries that I want to run
>>> on Spark. I believe that makes it a use case.
>>>
>>>> The Nessie <https://projectnessie.org/tools/hive/> example you
>>>> mentioned is a reasonable use case to me: some frameworks/applications want
>>>> to create external tables without user-specified location, so that they can
>>>> manage the table directory themselves and implement fancy features.
>>>>
>>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION.
>>>> We should clearly document that, EXTERNAL and LOCATION are only applicable
>>>> for file-based data sources, and catalog implementation should fail if the
>>>> table has EXTERNAL or LOCATION property, but the table provider is not
>>>> file-based.
>>>>
>>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as
>>>> an external table. Hive gives warning when you create managed tables with
>>>> custom location, which means this behavior is not recommended. Shall we
>>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>>>
>>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>>>
>>>>> The keyword came from Hive and we all agree that a Hive catalog with
>>>>> Hive behavior can’t be implemented if Spark chooses to couple this with
>>>>> LOCATION. Why is this use case not a justification?
>>>>>
>>>>> Also, the option to keep behavior the same as before is not mutually
>>>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to
>>>>> have the same behavior in its catalog. But Spark cannot just choose to
>>>>> break compatibility with external systems by deciding when to fail certain
>>>>> combinations of DDL options. Choosing not to allow external without
>>>>> location when it is valid for Hive prevents building a compatible catalog.
>>>>>
>>>>> There are many reasons to build a Hive-compatible catalog. A great
>>>>> recent example is Nessie <https://projectnessie.org/tools/hive/>,
>>>>> which enables branching and tagging table states across several table
>>>>> formats and aims to be compatible with Hive.
>>>>>
>>>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > As someone who's had the job of porting different SQL dialects to
>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>>>
>>>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2
>>>>>> options we are discussing are:
>>>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>>>>> with LOCATION (or path option).
>>>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>>>
>>>>>> I'm fine with option 2 if there are reasonable use cases. I think
>>>>>> it's always safer to keep the behavior the same as before. If we want to
>>>>>> change the behavior and follow option 2, we need use cases to justify it.
>>>>>>
>>>>>> For now, the only use case I see is for Hive compatibility and allow
>>>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>>>>> cases we are targeting?
>>>>>>
>>>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> As someone who's had the job of porting different SQL dialects to
>>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>>>>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>>>>>> sense.
>>>>>>>
>>>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would summarize both the problem and the current state
>>>>>>>> differently.
>>>>>>>>
>>>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility
>>>>>>>> with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table
>>>>>>>> with EXTERNAL unless LOCATION is also present. *This “hidden
>>>>>>>> feature” breaks compatibility with Hive SQL* because all
>>>>>>>> combinations of EXTERNAL and LOCATION are valid in Hive, but
>>>>>>>> creating an external table with a default location is not allowed by Spark.
>>>>>>>> Note that Spark must still handle these tables because it shares a
>>>>>>>> metastore with Hive, which can still create them.
>>>>>>>>
>>>>>>>> Now catalogs can be plugged in, the question is whether to pass the
>>>>>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2
>>>>>>>> catalog handling a create command, or to suppress it and apply Spark’s rule
>>>>>>>> that LOCATION must be present.
>>>>>>>>
>>>>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>>>>> implement the behavior of Hive SQL, even though Spark added the keyword for
>>>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL
>>>>>>>> however Spark chooses to, but I think it is a poor choice to force
>>>>>>>> different behavior on other catalogs.
>>>>>>>>
>>>>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>>>>>> that goal. Whether to physically delete data is a choice that is up to the
>>>>>>>> catalog. Some catalogs have no “external” concept and will always drop data
>>>>>>>> when a table is dropped. The ability to keep underlying data files is
>>>>>>>> specific to a few catalogs, and whether that is controlled by
>>>>>>>> EXTERNAL, the LOCATION clause, or something else is still up to
>>>>>>>> the catalog implementation.
>>>>>>>>
>>>>>>>> I don’t think that there is a good reason to force catalogs to
>>>>>>>> break compatibility with Hive SQL, while making it appear as though DDL is
>>>>>>>> compatible. Because removing EXTERNAL would be a breaking change
>>>>>>>> to the SQL parser, I think the best option is to pass it to v2 catalogs so
>>>>>>>> the catalog can decide how to handle it.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I'd like to start a discussion thread about this topic, as it
>>>>>>>>> blocks an important feature that we target for Spark 3.1: unify the CREATE
>>>>>>>>> TABLE SQL syntax.
>>>>>>>>>
>>>>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a
>>>>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>>>>
>>>>>>>>> When you write native CREATE TABLE syntax such as `CREATE
>>>>>>>>> EXTERNAL TABLE ... USING parquet`, the parser fails and tells you
>>>>>>>>> that EXTERNAL can't be specified.
>>>>>>>>>
>>>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be
>>>>>>>>> specified if LOCATION clause or path option is present. For example, `CREATE
>>>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is
>>>>>>>>> no LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>>>>>
>>>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how
>>>>>>>>> to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as
>>>>>>>>> it was, or we can officially support it.
>>>>>>>>>
>>>>>>>>> Please let us know your thoughts:
>>>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do?
>>>>>>>>> Have you used it in production before? For what use cases?
>>>>>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>>>>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>>>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>>>>>>> catalogs like jdbc, cassandra, etc.
>>>>>>>>>
>>>>>>>>> For more details, please refer to the long discussion in
>>>>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wenchen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Official support of CREATE EXTERNAL TABLE

Posted by Wenchen Fan <cl...@gmail.com>.

> I have some hive queries that I want to run on Spark.

Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
LOCATION can't help you too much here. If you do have this use case, we
need a much wider discussion about how to achieve it.

For this particular topic, we need concrete use cases like Nessie
<https://projectnessie.org/tools/hive/>. It will be great to see more
concrete use cases, but I think the Nessie use case is good enough to
justify the decoupling of EXTERNAL and LOCATION.

BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
have it. That's why I think Hive-compatibility alone is not a reasonable
use case. For your reference:
1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
clause as Spark does: doc
<https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION clause
as Spark does: doc
<https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME
option: doc
<https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
4. SQL Server also supports CREATE EXTERNAL TABLE: doc
<https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>

> with which Spark claims to be compatible

I don't think Spark ever claims to be 100% Hive compatible. In fact, we
diverged from Hive intentionally in several places, where we think the Hive
behavior was not reasonable and we shouldn't follow it.

On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <rb...@netflix.com> wrote:

> how about LOCATION without EXTERNAL? Currently Spark treats it as an
> external table.
>
> I think there is some confusion about what Spark has to handle. Regardless
> of what Spark allows as DDL, these tables can exist in a Hive MetaStore
> that Spark connects to, and the general expectation is that Spark doesn’t
> change the meaning of table configuration. There are notable bugs where
> Spark has different behavior, but that is the expectation.
>
> In this particular case, we’re talking about what can be expressed in DDL
> that is sent to an external catalog. Spark could (unwisely) choose to
> disallow some DDL combinations, but the table is implemented through a
> plugin so the interpretation is up to the plugin. Spark has no role in
> choosing how to treat this table, unless it is loaded through Spark’s
> built-in catalog; in which case, see above.
>
> I don’t think Hive compatibility itself is a “use case”.
>
> Why?
>
> Hive is an external database that defines its own behavior and with which
> Spark claims to be compatible. If Hive isn’t a valid use case, then why is
> EXTERNAL supported at all?
>
> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <ho...@pigscanfly.ca> wrote:
>
>>
>>
>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> I don't think Hive compatibility itself is a "use case".
>>>
>> Ok let's add on top of this: I have some hive queries that I want to run
>> on Spark. I believe that makes it a use case.
>>
>>> The Nessie <https://projectnessie.org/tools/hive/> example you
>>> mentioned is a reasonable use case to me: some frameworks/applications want
>>> to create external tables without user-specified location, so that they can
>>> manage the table directory themselves and implement fancy features.
>>>
>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
>>> should clearly document that, EXTERNAL and LOCATION are only applicable for
>>> file-based data sources, and catalog implementation should fail if the
>>> table has EXTERNAL or LOCATION property, but the table provider is not
>>> file-based.
>>>
>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as
>>> an external table. Hive gives warning when you create managed tables with
>>> custom location, which means this behavior is not recommended. Shall we
>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>>
>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>>
>>>> The keyword came from Hive and we all agree that a Hive catalog with
>>>> Hive behavior can’t be implemented if Spark chooses to couple this with
>>>> LOCATION. Why is this use case not a justification?
>>>>
>>>> Also, the option to keep behavior the same as before is not mutually
>>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to
>>>> have the same behavior in its catalog. But Spark cannot just choose to
>>>> break compatibility with external systems by deciding when to fail certain
>>>> combinations of DDL options. Choosing not to allow external without
>>>> location when it is valid for Hive prevents building a compatible catalog.
>>>>
>>>> There are many reasons to build a Hive-compatible catalog. A great
>>>> recent example is Nessie <https://projectnessie.org/tools/hive/>,
>>>> which enables branching and tagging table states across several table
>>>> formats and aims to be compatible with Hive.
>>>>
>>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>>
>>>>> > As someone who's had the job of porting different SQL dialects to
>>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>>
>>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2
>>>>> options we are discussing are:
>>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>>>> with LOCATION (or path option).
>>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>>
>>>>> I'm fine with option 2 if there are reasonable use cases. I think it's
>>>>> always safer to keep the behavior the same as before. If we want to change
>>>>> the behavior and follow option 2, we need use cases to justify it.
>>>>>
>>>>> For now, the only use case I see is for Hive compatibility and allow
>>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>>>> cases we are targeting?
>>>>>
>>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> As someone who's had the job of porting different SQL dialects to
>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>>>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>>>>> sense.
>>>>>>
>>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> I would summarize both the problem and the current state differently.
>>>>>>>
>>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>>>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>>>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>>>>>>> breaks compatibility with Hive SQL* because all combinations of
>>>>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external
>>>>>>> table with a default location is not allowed by Spark. Note that Spark must
>>>>>>> still handle these tables because it shares a metastore with Hive, which
>>>>>>> can still create them.
>>>>>>>
>>>>>>> Now catalogs can be plugged in, the question is whether to pass the
>>>>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2
>>>>>>> catalog handling a create command, or to suppress it and apply Spark’s rule
>>>>>>> that LOCATION must be present.
>>>>>>>
>>>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>>>> implement the behavior of Hive SQL, even though Spark added the keyword for
>>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL
>>>>>>> however Spark chooses to, but I think it is a poor choice to force
>>>>>>> different behavior on other catalogs.
>>>>>>>
>>>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>>>>> that goal. Whether to physically delete data is a choice that is up to the
>>>>>>> catalog. Some catalogs have no “external” concept and will always drop data
>>>>>>> when a table is dropped. The ability to keep underlying data files is
>>>>>>> specific to a few catalogs, and whether that is controlled by
>>>>>>> EXTERNAL, the LOCATION clause, or something else is still up to the
>>>>>>> catalog implementation.
>>>>>>>
>>>>>>> I don’t think that there is a good reason to force catalogs to break
>>>>>>> compatibility with Hive SQL, while making it appear as though DDL is
>>>>>>> compatible. Because removing EXTERNAL would be a breaking change to
>>>>>>> the SQL parser, I think the best option is to pass it to v2 catalogs so the
>>>>>>> catalog can decide how to handle it.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'd like to start a discussion thread about this topic, as it
>>>>>>>> blocks an important feature that we target for Spark 3.1: unify the CREATE
>>>>>>>> TABLE SQL syntax.
>>>>>>>>
>>>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a
>>>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>>>
>>>>>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>>>>>> TABLE ... USING parquet`, the parser fails and tells you that
>>>>>>>> EXTERNAL can't be specified.
>>>>>>>>
>>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be
>>>>>>>> specified if LOCATION clause or path option is present. For example, `CREATE
>>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is
>>>>>>>> no LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>>>>
>>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how
>>>>>>>> to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as
>>>>>>>> it was, or we can officially support it.
>>>>>>>>
>>>>>>>> Please let us know your thoughts:
>>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do?
>>>>>>>> Have you used it in production before? For what use cases?
>>>>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>>>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>>>>>> catalogs like jdbc, cassandra, etc.
>>>>>>>>
>>>>>>>> For more details, please refer to the long discussion in
>>>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Wenchen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Official support of CREATE EXTERNAL TABLE

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

how about LOCATION without EXTERNAL? Currently Spark treats it as an
external table.

I think there is some confusion about what Spark has to handle. Regardless
of what Spark allows as DDL, these tables can exist in a Hive MetaStore
that Spark connects to, and the general expectation is that Spark doesn’t
change the meaning of table configuration. There are notable bugs where
Spark has different behavior, but that is the expectation.

In this particular case, we’re talking about what can be expressed in DDL
that is sent to an external catalog. Spark could (unwisely) choose to
disallow some DDL combinations, but the table is implemented through a
plugin so the interpretation is up to the plugin. Spark has no role in
choosing how to treat this table, unless it is loaded through Spark’s
built-in catalog; in which case, see above.

I don’t think Hive compatibility itself is a “use case”.

Why?

Hive is an external database that defines its own behavior and with which
Spark claims to be compatible. If Hive isn’t a valid use case, then why is
EXTERNAL supported at all?

On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <ho...@pigscanfly.ca> wrote:

>
>
> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> I don't think Hive compatibility itself is a "use case".
>>
> Ok let's add on top of this: I have some hive queries that I want to run
> on Spark. I believe that makes it a use case.
>
>> The Nessie <https://projectnessie.org/tools/hive/> example you mentioned
>> is a reasonable use case to me: some frameworks/applications want to create
>> external tables without user-specified location, so that they can manage
>> the table directory themselves and implement fancy features.
>>
>> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
>> should clearly document that, EXTERNAL and LOCATION are only applicable for
>> file-based data sources, and catalog implementation should fail if the
>> table has EXTERNAL or LOCATION property, but the table provider is not
>> file-based.
>>
>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table. Hive gives warning when you create managed tables with
>> custom location, which means this behavior is not recommended. Shall we
>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>
>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>
>>> The keyword came from Hive and we all agree that a Hive catalog with
>>> Hive behavior can’t be implemented if Spark chooses to couple this with
>>> LOCATION. Why is this use case not a justification?
>>>
>>> Also, the option to keep behavior the same as before is not mutually
>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
>>> the same behavior in its catalog. But Spark cannot just choose to break
>>> compatibility with external systems by deciding when to fail certain
>>> combinations of DDL options. Choosing not to allow external without
>>> location when it is valid for Hive prevents building a compatible catalog.
>>>
>>> There are many reasons to build a Hive-compatible catalog. A great
>>> recent example is Nessie <https://projectnessie.org/tools/hive/>, which
>>> enables branching and tagging table states across several table formats and
>>> aims to be compatible with Hive.
>>>
>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> > As someone who's had the job of porting different SQL dialects to
>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>
>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>>>> we are discussing are:
>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>>> with LOCATION (or path option).
>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>
>>>> I'm fine with option 2 if there are reasonable use cases. I think it's
>>>> always safer to keep the behavior the same as before. If we want to change
>>>> the behavior and follow option 2, we need use cases to justify it.
>>>>
>>>> For now, the only use case I see is for Hive compatibility and allow
>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>>> cases we are targeting?
>>>>
>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> As someone who's had the job of porting different SQL dialects to
>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>>>> sense.
>>>>>
>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I would summarize both the problem and the current state differently.
>>>>>>
>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>>>>>> breaks compatibility with Hive SQL* because all combinations of
>>>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external
>>>>>> table with a default location is not allowed by Spark. Note that Spark must
>>>>>> still handle these tables because it shares a metastore with Hive, which
>>>>>> can still create them.
>>>>>>
>>>>>> Now catalogs can be plugged in, the question is whether to pass the
>>>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2
>>>>>> catalog handling a create command, or to suppress it and apply Spark’s rule
>>>>>> that LOCATION must be present.
>>>>>>
>>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>>> implement the behavior of Hive SQL, even though Spark added the keyword for
>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL however
>>>>>> Spark chooses to, but I think it is a poor choice to force different
>>>>>> behavior on other catalogs.
>>>>>>
>>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>>>> that goal. Whether to physically delete data is a choice that is up to the
>>>>>> catalog. Some catalogs have no “external” concept and will always drop data
>>>>>> when a table is dropped. The ability to keep underlying data files is
>>>>>> specific to a few catalogs, and whether that is controlled by
>>>>>> EXTERNAL, the LOCATION clause, or something else is still up to the
>>>>>> catalog implementation.
>>>>>>
>>>>>> I don’t think that there is a good reason to force catalogs to break
>>>>>> compatibility with Hive SQL, while making it appear as though DDL is
>>>>>> compatible. Because removing EXTERNAL would be a breaking change to
>>>>>> the SQL parser, I think the best option is to pass it to v2 catalogs so the
>>>>>> catalog can decide how to handle it.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'd like to start a discussion thread about this topic, as it blocks
>>>>>>> an important feature that we target for Spark 3.1: unify the CREATE TABLE
>>>>>>> SQL syntax.
>>>>>>>
>>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a
>>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>>
>>>>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>>>>> TABLE ... USING parquet`, the parser fails and tells you that
>>>>>>> EXTERNAL can't be specified.
>>>>>>>
>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be
>>>>>>> specified if LOCATION clause or path option is present. For example, `CREATE
>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>>>>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>>>
>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how
>>>>>>> to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as
>>>>>>> it was, or we can officially support it.
>>>>>>>
>>>>>>> Please let us know your thoughts:
>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do?
>>>>>>> Have you used it in production before? For what use cases?
>>>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>>>>> catalogs like jdbc, cassandra, etc.
>>>>>>>
>>>>>>> For more details, please refer to the long discussion in
>>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Wenchen
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Official support of CREATE EXTERNAL TABLE

Posted by Holden Karau <ho...@pigscanfly.ca>.

On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cl...@gmail.com> wrote:

> I don't think Hive compatibility itself is a "use case".
>
Ok let's add on top of this: I have some hive queries that I want to run on
Spark. I believe that makes it a use case.

> The Nessie <https://projectnessie.org/tools/hive/> example you mentioned
> is a reasonable use case to me: some frameworks/applications want to create
> external tables without user-specified location, so that they can manage
> the table directory themselves and implement fancy features.
>
> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
> should clearly document that, EXTERNAL and LOCATION are only applicable for
> file-based data sources, and catalog implementation should fail if the
> table has EXTERNAL or LOCATION property, but the table provider is not
> file-based.
>
> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
> external table. Hive gives warning when you create managed tables with
> custom location, which means this behavior is not recommended. Shall we
> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>
> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>
>> The keyword came from Hive and we all agree that a Hive catalog with Hive
>> behavior can’t be implemented if Spark chooses to couple this with
>> LOCATION. Why is this use case not a justification?
>>
>> Also, the option to keep behavior the same as before is not mutually
>> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
>> the same behavior in its catalog. But Spark cannot just choose to break
>> compatibility with external systems by deciding when to fail certain
>> combinations of DDL options. Choosing not to allow external without
>> location when it is valid for Hive prevents building a compatible catalog.
>>
>> There are many reasons to build a Hive-compatible catalog. A great recent
>> example is Nessie <https://projectnessie.org/tools/hive/>, which enables
>> branching and tagging table states across several table formats and aims to
>> be compatible with Hive.
>>
>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> > As someone who's had the job of porting different SQL dialects to
>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>
>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>>> we are discussing are:
>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>> with LOCATION (or path option).
>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>
>>> I'm fine with option 2 if there are reasonable use cases. I think it's
>>> always safer to keep the behavior the same as before. If we want to change
>>> the behavior and follow option 2, we need use cases to justify it.
>>>
>>> For now, the only use case I see is for Hive compatibility and allow
>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>> cases we are targeting?
>>>
>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> As someone who's had the job of porting different SQL dialects to
>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>>> sense.
>>>>
>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> I would summarize both the problem and the current state differently.
>>>>>
>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>>>>> breaks compatibility with Hive SQL* because all combinations of
>>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external
>>>>> table with a default location is not allowed by Spark. Note that Spark must
>>>>> still handle these tables because it shares a metastore with Hive, which
>>>>> can still create them.
>>>>>
>>>>> Now catalogs can be plugged in, the question is whether to pass the
>>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2
>>>>> catalog handling a create command, or to suppress it and apply Spark’s rule
>>>>> that LOCATION must be present.
>>>>>
>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>> implement the behavior of Hive SQL, even though Spark added the keyword for
>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL however
>>>>> Spark chooses to, but I think it is a poor choice to force different
>>>>> behavior on other catalogs.
>>>>>
>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>>> that goal. Whether to physically delete data is a choice that is up to the
>>>>> catalog. Some catalogs have no “external” concept and will always drop data
>>>>> when a table is dropped. The ability to keep underlying data files is
>>>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>>>> the LOCATION clause, or something else is still up to the catalog
>>>>> implementation.
>>>>>
>>>>> I don’t think that there is a good reason to force catalogs to break
>>>>> compatibility with Hive SQL, while making it appear as though DDL is
>>>>> compatible. Because removing EXTERNAL would be a breaking change to
>>>>> the SQL parser, I think the best option is to pass it to v2 catalogs so the
>>>>> catalog can decide how to handle it.
>>>>>
>>>>> rb
>>>>>
>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like to start a discussion thread about this topic, as it blocks
>>>>>> an important feature that we target for Spark 3.1: unify the CREATE TABLE
>>>>>> SQL syntax.
>>>>>>
>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a
>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>
>>>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>>>> TABLE ... USING parquet`, the parser fails and tells you that
>>>>>> EXTERNAL can't be specified.
>>>>>>
>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified
>>>>>> if LOCATION clause or path option is present. For example, `CREATE
>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>>>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>>
>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>>>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>>>>> was, or we can officially support it.
>>>>>>
>>>>>> Please let us know your thoughts:
>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do?
>>>>>> Have you used it in production before? For what use cases?
>>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>>>> catalogs like jdbc, cassandra, etc.
>>>>>>
>>>>>> For more details, please refer to the long discussion in
>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Official support of CREATE EXTERNAL TABLE

Posted by Wenchen Fan <cl...@gmail.com>.

I don't think Hive compatibility itself is a "use case". The Nessie
<https://projectnessie.org/tools/hive/> example you mentioned is a
reasonable use case to me: some frameworks/applications want to create
external tables without user-specified location, so that they can manage
the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
should clearly document that, EXTERNAL and LOCATION are only applicable for
file-based data sources, and catalog implementation should fail if the
table has EXTERNAL or LOCATION property, but the table provider is not
file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
external table. Hive gives warning when you create managed tables with
custom location, which means this behavior is not recommended. Shall we
"infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>
> The keyword came from Hive and we all agree that a Hive catalog with Hive
> behavior can’t be implemented if Spark chooses to couple this with
> LOCATION. Why is this use case not a justification?
>
> Also, the option to keep behavior the same as before is not mutually
> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
> the same behavior in its catalog. But Spark cannot just choose to break
> compatibility with external systems by deciding when to fail certain
> combinations of DDL options. Choosing not to allow external without
> location when it is valid for Hive prevents building a compatible catalog.
>
> There are many reasons to build a Hive-compatible catalog. A great recent
> example is Nessie <https://projectnessie.org/tools/hive/>, which enables
> branching and tagging table states across several table formats and aims to
> be compatible with Hive.
>
> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> > As someone who's had the job of porting different SQL dialects to
>> Spark, I'm also very much in favor of keeping EXTERNAL
>>
>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>> we are discussing are:
>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>> with LOCATION (or path option).
>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>
>> I'm fine with option 2 if there are reasonable use cases. I think it's
>> always safer to keep the behavior the same as before. If we want to change
>> the behavior and follow option 2, we need use cases to justify it.
>>
>> For now, the only use case I see is for Hive compatibility and allow
>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>> cases we are targeting?
>>
>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca> wrote:
>>
>>> As someone who's had the job of porting different SQL dialects to Spark,
>>> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>> sense.
>>>
>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I would summarize both the problem and the current state differently.
>>>>
>>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>>>> breaks compatibility with Hive SQL* because all combinations of
>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external
>>>> table with a default location is not allowed by Spark. Note that Spark must
>>>> still handle these tables because it shares a metastore with Hive, which
>>>> can still create them.
>>>>
>>>> Now catalogs can be plugged in, the question is whether to pass the
>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>>>> handling a create command, or to suppress it and apply Spark’s rule that
>>>> LOCATION must be present.
>>>>
>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>> implement the behavior of Hive SQL, even though Spark added the keyword for
>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL however
>>>> Spark chooses to, but I think it is a poor choice to force different
>>>> behavior on other catalogs.
>>>>
>>>> Wenchen has also argued that the purpose of this is to standardize
>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>> that goal. Whether to physically delete data is a choice that is up to the
>>>> catalog. Some catalogs have no “external” concept and will always drop data
>>>> when a table is dropped. The ability to keep underlying data files is
>>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>>> the LOCATION clause, or something else is still up to the catalog
>>>> implementation.
>>>>
>>>> I don’t think that there is a good reason to force catalogs to break
>>>> compatibility with Hive SQL, while making it appear as though DDL is
>>>> compatible. Because removing EXTERNAL would be a breaking change to
>>>> the SQL parser, I think the best option is to pass it to v2 catalogs so the
>>>> catalog can decide how to handle it.
>>>>
>>>> rb
>>>>
>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to start a discussion thread about this topic, as it blocks
>>>>> an important feature that we target for Spark 3.1: unify the CREATE TABLE
>>>>> SQL syntax.
>>>>>
>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>>>>> feature in Spark for Hive compatibility.
>>>>>
>>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>>> TABLE ... USING parquet`, the parser fails and tells you that
>>>>> EXTERNAL can't be specified.
>>>>>
>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified
>>>>> if LOCATION clause or path option is present. For example, `CREATE
>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>
>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>>>> was, or we can officially support it.
>>>>>
>>>>> Please let us know your thoughts:
>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do?
>>>>> Have you used it in production before? For what use cases?
>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>>> catalogs like jdbc, cassandra, etc.
>>>>>
>>>>> For more details, please refer to the long discussion in
>>>>> https://github.com/apache/spark/pull/28026
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Official support of CREATE EXTERNAL TABLE

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive
behavior can’t be implemented if Spark chooses to couple this with LOCATION.
Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually
exclusive with passing EXTERNAL to catalogs. Spark can continue to have the
same behavior in its catalog. But Spark cannot just choose to break
compatibility with external systems by deciding when to fail certain
combinations of DDL options. Choosing not to allow external without
location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent
example is Nessie <https://projectnessie.org/tools/hive/>, which enables
branching and tagging table states across several table formats and aims to
be compatible with Hive.

On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cl...@gmail.com> wrote:

> > As someone who's had the job of porting different SQL dialects to Spark,
> I'm also very much in favor of keeping EXTERNAL
>
> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we
> are discussing are:
> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with
> LOCATION (or path option).
> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>
> I'm fine with option 2 if there are reasonable use cases. I think it's
> always safer to keep the behavior the same as before. If we want to change
> the behavior and follow option 2, we need use cases to justify it.
>
> For now, the only use case I see is for Hive compatibility and allow
> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
> cases we are targeting?
>
> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> As someone who's had the job of porting different SQL dialects to Spark,
>> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>> suggestion of leaving it up to the catalogs on how to handle this makes
>> sense.
>>
>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I would summarize both the problem and the current state differently.
>>>
>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
>>> compatibility with Hive SQL* because all combinations of EXTERNAL and
>>> LOCATION are valid in Hive, but creating an external table with a
>>> default location is not allowed by Spark. Note that Spark must still handle
>>> these tables because it shares a metastore with Hive, which can still
>>> create them.
>>>
>>> Now catalogs can be plugged in, the question is whether to pass the fact
>>> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>>> handling a create command, or to suppress it and apply Spark’s rule that
>>> LOCATION must be present.
>>>
>>> If it is not passed to the catalog, then a Hive catalog cannot implement
>>> the behavior of Hive SQL, even though Spark added the keyword for Hive
>>> compatibility. The Spark catalog can interpret EXTERNAL however Spark
>>> chooses to, but I think it is a poor choice to force different behavior on
>>> other catalogs.
>>>
>>> Wenchen has also argued that the purpose of this is to standardize
>>> behavior across catalogs. But hiding EXTERNAL would not accomplish that
>>> goal. Whether to physically delete data is a choice that is up to the
>>> catalog. Some catalogs have no “external” concept and will always drop data
>>> when a table is dropped. The ability to keep underlying data files is
>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>> the LOCATION clause, or something else is still up to the catalog
>>> implementation.
>>>
>>> I don’t think that there is a good reason to force catalogs to break
>>> compatibility with Hive SQL, while making it appear as though DDL is
>>> compatible. Because removing EXTERNAL would be a breaking change to the
>>> SQL parser, I think the best option is to pass it to v2 catalogs so the
>>> catalog can decide how to handle it.
>>>
>>> rb
>>>
>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to start a discussion thread about this topic, as it blocks an
>>>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>>>> syntax.
>>>>
>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>>>> feature in Spark for Hive compatibility.
>>>>
>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>> TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL
>>>> can't be specified.
>>>>
>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified
>>>> if LOCATION clause or path option is present. For example, `CREATE
>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>>
>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>>> was, or we can officially support it.
>>>>
>>>> Please let us know your thoughts:
>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>>>> you used it in production before? For what use cases?
>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>> catalogs like jdbc, cassandra, etc.
>>>>
>>>> For more details, please refer to the long discussion in
>>>> https://github.com/apache/spark/pull/28026
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Official support of CREATE EXTERNAL TABLE

Posted by Wenchen Fan <cl...@gmail.com>.

> As someone who's had the job of porting different SQL dialects to Spark,
I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we
are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with
LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's
always safer to keep the behavior the same as before. If we want to change
the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow
EXTERNAL TABLE without user-specified LOCATION. Are there any more use
cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <ho...@pigscanfly.ca> wrote:

> As someone who's had the job of porting different SQL dialects to Spark,
> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
> suggestion of leaving it up to the catalogs on how to handle this makes
> sense.
>
> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I would summarize both the problem and the current state differently.
>>
>> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
>> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
>> compatibility with Hive SQL* because all combinations of EXTERNAL and
>> LOCATION are valid in Hive, but creating an external table with a
>> default location is not allowed by Spark. Note that Spark must still handle
>> these tables because it shares a metastore with Hive, which can still
>> create them.
>>
>> Now catalogs can be plugged in, the question is whether to pass the fact
>> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>> handling a create command, or to suppress it and apply Spark’s rule that
>> LOCATION must be present.
>>
>> If it is not passed to the catalog, then a Hive catalog cannot implement
>> the behavior of Hive SQL, even though Spark added the keyword for Hive
>> compatibility. The Spark catalog can interpret EXTERNAL however Spark
>> chooses to, but I think it is a poor choice to force different behavior on
>> other catalogs.
>>
>> Wenchen has also argued that the purpose of this is to standardize
>> behavior across catalogs. But hiding EXTERNAL would not accomplish that
>> goal. Whether to physically delete data is a choice that is up to the
>> catalog. Some catalogs have no “external” concept and will always drop data
>> when a table is dropped. The ability to keep underlying data files is
>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>> the LOCATION clause, or something else is still up to the catalog
>> implementation.
>>
>> I don’t think that there is a good reason to force catalogs to break
>> compatibility with Hive SQL, while making it appear as though DDL is
>> compatible. Because removing EXTERNAL would be a breaking change to the
>> SQL parser, I think the best option is to pass it to v2 catalogs so the
>> catalog can decide how to handle it.
>>
>> rb
>>
>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start a discussion thread about this topic, as it blocks an
>>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>>> syntax.
>>>
>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>>> feature in Spark for Hive compatibility.
>>>
>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>> TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL
>>> can't be specified.
>>>
>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>>> LOCATION clause or path option is present. For example, `CREATE
>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>
>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>> was, or we can officially support it.
>>>
>>> Please let us know your thoughts:
>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>>> you used it in production before? For what use cases?
>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>> TABLE? It seems to me that it only makes sense for file source, as the
>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>> catalogs like jdbc, cassandra, etc.
>>>
>>> For more details, please refer to the long discussion in
>>> https://github.com/apache/spark/pull/28026
>>>
>>> Thanks,
>>> Wenchen
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Official support of CREATE EXTERNAL TABLE

Posted by Holden Karau <ho...@pigscanfly.ca>.

As someone who's had the job of porting different SQL dialects to Spark,
I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
suggestion of leaving it up to the catalogs on how to handle this makes
sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I would summarize both the problem and the current state differently.
>
> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
> compatibility with Hive SQL* because all combinations of EXTERNAL and
> LOCATION are valid in Hive, but creating an external table with a default
> location is not allowed by Spark. Note that Spark must still handle these
> tables because it shares a metastore with Hive, which can still create them.
>
> Now catalogs can be plugged in, the question is whether to pass the fact
> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
> handling a create command, or to suppress it and apply Spark’s rule that
> LOCATION must be present.
>
> If it is not passed to the catalog, then a Hive catalog cannot implement
> the behavior of Hive SQL, even though Spark added the keyword for Hive
> compatibility. The Spark catalog can interpret EXTERNAL however Spark
> chooses to, but I think it is a poor choice to force different behavior on
> other catalogs.
>
> Wenchen has also argued that the purpose of this is to standardize
> behavior across catalogs. But hiding EXTERNAL would not accomplish that
> goal. Whether to physically delete data is a choice that is up to the
> catalog. Some catalogs have no “external” concept and will always drop data
> when a table is dropped. The ability to keep underlying data files is
> specific to a few catalogs, and whether that is controlled by EXTERNAL,
> the LOCATION clause, or something else is still up to the catalog
> implementation.
>
> I don’t think that there is a good reason to force catalogs to break
> compatibility with Hive SQL, while making it appear as though DDL is
> compatible. Because removing EXTERNAL would be a breaking change to the
> SQL parser, I think the best option is to pass it to v2 catalogs so the
> catalog can decide how to handle it.
>
> rb
>
> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'd like to start a discussion thread about this topic, as it blocks an
>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>> syntax.
>>
>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>> feature in Spark for Hive compatibility.
>>
>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
>> ... USING parquet`, the parser fails and tells you that EXTERNAL can't
>> be specified.
>>
>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
>> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
>> clause or path option. This is not 100% Hive compatible.
>>
>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>> was, or we can officially support it.
>>
>> Please let us know your thoughts:
>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>> you used it in production before? For what use cases?
>> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
>> It seems to me that it only makes sense for file source, as the table
>> directory can be managed. I'm not sure how to interpret EXTERNAL in
>> catalogs like jdbc, cassandra, etc.
>>
>> For more details, please refer to the long discussion in
>> https://github.com/apache/spark/pull/28026
>>
>> Thanks,
>> Wenchen
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Official support of CREATE EXTERNAL TABLE

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
SQL, but Spark’s built-in catalog doesn’t allow creating a table with
EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
compatibility with Hive SQL* because all combinations of EXTERNAL and
LOCATION are valid in Hive, but creating an external table with a default
location is not allowed by Spark. Note that Spark must still handle these
tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact
that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling
a create command, or to suppress it and apply Spark’s rule that LOCATION
must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement
the behavior of Hive SQL, even though Spark added the keyword for Hive
compatibility. The Spark catalog can interpret EXTERNAL however Spark
chooses to, but I think it is a poor choice to force different behavior on
other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior
across catalogs. But hiding EXTERNAL would not accomplish that goal.
Whether to physically delete data is a choice that is up to the catalog.
Some catalogs have no “external” concept and will always drop data when a
table is dropped. The ability to keep underlying data files is specific to
a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION
clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break
compatibility with Hive SQL, while making it appear as though DDL is
compatible. Because removing EXTERNAL would be a breaking change to the SQL
parser, I think the best option is to pass it to v2 catalogs so the catalog
can decide how to handle it.

rb

On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cl...@gmail.com> wrote:

> Hi all,
>
> I'd like to start a discussion thread about this topic, as it blocks an
> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
> syntax.
>
> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
> feature in Spark for Hive compatibility.
>
> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
> ... USING parquet`, the parser fails and tells you that EXTERNAL can't be
> specified.
>
> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
> clause or path option. This is not 100% Hive compatible.
>
> As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal
> with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was,
> or we can officially support it.
>
> Please let us know your thoughts:
> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
> you used it in production before? For what use cases?
> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
> It seems to me that it only makes sense for file source, as the table
> directory can be managed. I'm not sure how to interpret EXTERNAL in
> catalogs like jdbc, cassandra, etc.
>
> For more details, please refer to the long discussion in
> https://github.com/apache/spark/pull/28026
>
> Thanks,
> Wenchen
>

-- 
Ryan Blue
Software Engineer
Netflix