You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2018/10/04 15:56:17 UTC

Spark SQL parser and DDL

Hi everyone,

I’ve been working on SQL DDL statements for v2 tables lately, including the
proposed additions to drop, rename, and alter columns. The most recent
update I’ve added is to allow transformation functions in the PARTITION BY
clause to pass to v2 data sources. This allows sources like Iceberg to do
partition pruning internally.

One of the difficulties has been that the SQL parser is coupled to the
current logical plans and includes details that are specific to them. For
example, data source table creation makes determinations like the EXTERNAL
keyword is not allowed and instead the mode (external or managed) is set
depending on whether a path is set. It also translates IF NOT EXISTS into a
SaveMode and introduces a few other transformations.

The main problem with this is that converting the SQL plans produced by the
parser to v2 plans requires interpreting these alterations and not the
original SQL. Another consequence is that there are two parsers: AstBuilder
in spark-catalyst and SparkSqlParser in spark-sql (core) because not all of
the plans are available to the parser in the catalyst module.

I think it would be cleaner if we added a sql package with catalyst plans
that carry the SQL options as they were parsed, and then convert those
plans to specific implementations depending on the tables that are used.
That makes support for v2 plans much cleaner by converting from a generic
SQL plan instead of creating a v1 plan that assumes a data source table and
then converting that to a v2 plan (playing telephone with logical plans).

This has simplified the work I’ve been doing to add PARTITION BY
transformations. Instead of needing to add transformations to the
CatalogTable metadata that’s used everywhere, this only required a change
to the rule that converts from the parsed SQL plan to CatalogTable-based v1
plans. It is also cleaner to have the logic for converting to CatalogTable
in DataSourceAnalysis instead of in the parser itself.

Are there objections to this approach for integrating v2 plans?
-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark SQL parser and DDL

Posted by Felix Cheung <fe...@hotmail.com>.

Sounds like a good idea?

Would this be a step in the direction of supporting variation of the SQL dialect, too?

________________________________
From: Ryan Blue <rb...@netflix.com.invalid>
Sent: Thursday, October 4, 2018 8:56 AM
To: Spark Dev List
Subject: Spark SQL parser and DDL

Hi everyone,

I’ve been working on SQL DDL statements for v2 tables lately, including the proposed additions to drop, rename, and alter columns. The most recent update I’ve added is to allow transformation functions in the PARTITION BY clause to pass to v2 data sources. This allows sources like Iceberg to do partition pruning internally.

One of the difficulties has been that the SQL parser is coupled to the current logical plans and includes details that are specific to them. For example, data source table creation makes determinations like the EXTERNAL keyword is not allowed and instead the mode (external or managed) is set depending on whether a path is set. It also translates IF NOT EXISTS into a SaveMode and introduces a few other transformations.

The main problem with this is that converting the SQL plans produced by the parser to v2 plans requires interpreting these alterations and not the original SQL. Another consequence is that there are two parsers: AstBuilder in spark-catalyst and SparkSqlParser in spark-sql (core) because not all of the plans are available to the parser in the catalyst module.

I think it would be cleaner if we added a sql package with catalyst plans that carry the SQL options as they were parsed, and then convert those plans to specific implementations depending on the tables that are used. That makes support for v2 plans much cleaner by converting from a generic SQL plan instead of creating a v1 plan that assumes a data source table and then converting that to a v2 plan (playing telephone with logical plans).

This has simplified the work I’ve been doing to add PARTITION BY transformations. Instead of needing to add transformations to the CatalogTable metadata that’s used everywhere, this only required a change to the rule that converts from the parsed SQL plan to CatalogTable-based v1 plans. It is also cleaner to have the logic for converting to CatalogTable in DataSourceAnalysis instead of in the parser itself.

Are there objections to this approach for integrating v2 plans?

--
Ryan Blue
Software Engineer
Netflix