You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/30 23:11:00 UTC

[GitHub] [spark] rdblue commented on issue #28026: [SPARK-31257][SQL] Unify create table syntax (WIP)

rdblue commented on issue #28026: [SPARK-31257][SQL] Unify create table syntax (WIP)
URL: https://github.com/apache/spark/pull/28026#issuecomment-606299173

> I don't know who marked comments as resolved

That was me because I thought they were fairly straight-forward and definitive answers. We can still discuss them more since that wasn't the case.

> If I skimmed the code correctly, this "requires" end users to add USING hive even they add Hive create table specific clause.

That's not quite right.

The parser now creates a `CreateTableStatement` that contains everything that was parsed from the SQL string, with minimal interpretation. That is what the user requested, assuming that request was well-formed -- meaning that it doesn't have duplicate clauses or mixed partition fields.

The next step is to convert that statement to a plan. At the moment, this only converts to v2 plans because we no longer use v1 and this was ported from our internal version. The interpretation of that statement for v1 -- converting to v1 `CreateTable` -- has not yet been added. I plan to implement that like this:
* If `provider` is present from USING, set the `CatalogTable` provider and validate that no `SerdeInfo` is set.
* If `SerdeInfo` is present, validate that `provider` was not set and use a Hive `CatalogTable`.
* If neither `SerdeInfo` nor `provider` is set, use `SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED` or similar setting to set a default provider or default serde properties.

Speaking of well-formed, I think it might be reasonable to ensure that either a provider or serde info is present, but not both. That would raise an exception if both `USING` and `STORED AS` or `SERDE` clauses were used. I think that's probably reasonable. In that case, we could skip the checks above. What do you think?

Coming back to the current implementation for the v2 interface: Spark is passing through everything that was parsed because Spark doesn't know what kind of table the catalog is going to create. The v2 catalog that we use can create Hive tables, Spark datasource tables, and Iceberg tables, so all of the information needs to be passed through -- again, with the requirement that the SQL was well-formed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org