You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2018/02/02 18:39:12 UTC

DataSourceV2: support for named tables

There are two main ways to load tables in Spark: by name (db.table) and by
a path. Unfortunately, the integration for DataSourceV2 has no support for
identifying tables by name.

I propose supporting the use of TableIdentifier, which is the standard way
to pass around table names.

The reason I think we should do this is to easily support more ways of
working with DataSourceV2 tables. SQL statements and parts of the
DataFrameReader and DataFrameWriter APIs that use table names create
UnresolvedRelation instances that wrap an unresolved TableIdentifier.

By adding support for passing TableIdentifier to a DataSourceV2Relation,
then about all we need to enable these code paths is to add a resolution
rule. For that rule, we could easily identify a default data source that
handles named tables.

This is what we’re doing in our Spark build, and we have DataSourceV2
tables working great through SQL. (Part of this depends on the logical plan
changes from my previous email to ensure inserts are properly resolved.)

In the long term, I think we should update how we parse tables so that
TableIdentifier can contain a source in addition to a database/context and
a table name. That would allow us to integration new sources fairly
seamlessly, without needing to a rather redundant SQL create statement like
this:

CREATE TABLE database.name USING source OPTIONS (table 'database.name')

Also, I think we should pass TableIdentifier to DataSourceV2Relation,
rather than going with Wenchen’s suggestion that we pass the table name as
a string property, “table”. My rationale is that the new API shouldn’t leak
its internal details to other parts of the planner.

If we were to convert TableIdentifer to a “table” property wherever
DataSourceV2Relation is created, we create several places that need to be
in sync with the same convention. On the other hand, passing TableIdentifier
to DataSourceV2Relation and relying on the relation to correctly set the
options passed to readers and writers minimizes the number of places that
conversion needs to happen.

rb

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2: support for named tables

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don’t have a good answer for that yet. My initial motivation here is
mainly to get consensus around this:

   - DSv2 should support table names through SQL and the API, and
   - It should use the existing classes in the logical plan (i.e.,
   TableIdentifier)

To contrast, I think Wenchen is arguing that the new API should pass string
options to data sources and let the implementations decide what to do. (I
hope I’ve characterized that correctly.)

We’re storing references to our Iceberg tables
<https://github.com/Netflix/iceberg#about-iceberg> in the metastore, so I
added a simple rule to resolve TableIdentifier to
DataSourceV2Relation(IcebergSource). Everything worked fine for us after
that, but it isn’t a solution that will work in Spark upstream.

I think the right long-term solution is to mirror what Presto does. It adds
an optional catalog context to table identifiers: catalog.database.table.
Catalogs are defined in configuration and can share implementations, so you
can have two catalogs using the JDBC source or two Hive metastores.

The Presto model would require a bit of work, but I think it is going to be
necessary anyway before DSv2 is finished and really usable. We need to
separate table creation from writes to ensure consistent behavior, while
making it easier to implement the v2 write API. Otherwise, we will end up
with whatever semantics each data source implementation chooses: overwrite
might mean recreate the entire table, or might mean overwrite partitions.

rb

On Fri, Feb 2, 2018 at 4:11 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> I am definitely in favor of first-class / consistent support for tables
> and data sources.
>
> One thing that is not clear to me from this proposal is exactly what the
> interfaces are between:
>  - Spark
>  - A (The?) metastore
>  - A data source
>
> If we pass in the table identifier is the data source then responsible for
> talking directly to the metastore? Is that what we want? (I'm not sure)
>
> On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> There are two main ways to load tables in Spark: by name (db.table) and
>> by a path. Unfortunately, the integration for DataSourceV2 has no support
>> for identifying tables by name.
>>
>> I propose supporting the use of TableIdentifier, which is the standard
>> way to pass around table names.
>>
>> The reason I think we should do this is to easily support more ways of
>> working with DataSourceV2 tables. SQL statements and parts of the
>> DataFrameReader and DataFrameWriter APIs that use table names create
>> UnresolvedRelation instances that wrap an unresolved TableIdentifier.
>>
>> By adding support for passing TableIdentifier to a DataSourceV2Relation,
>> then about all we need to enable these code paths is to add a resolution
>> rule. For that rule, we could easily identify a default data source that
>> handles named tables.
>>
>> This is what we’re doing in our Spark build, and we have DataSourceV2
>> tables working great through SQL. (Part of this depends on the logical plan
>> changes from my previous email to ensure inserts are properly resolved.)
>>
>> In the long term, I think we should update how we parse tables so that
>> TableIdentifier can contain a source in addition to a database/context
>> and a table name. That would allow us to integration new sources fairly
>> seamlessly, without needing to a rather redundant SQL create statement like
>> this:
>>
>> CREATE TABLE database.name USING source OPTIONS (table 'database.name')
>>
>> Also, I think we should pass TableIdentifier to DataSourceV2Relation,
>> rather than going with Wenchen’s suggestion that we pass the table name as
>> a string property, “table”. My rationale is that the new API shouldn’t leak
>> its internal details to other parts of the planner.
>>
>> If we were to convert TableIdentifer to a “table” property wherever
>> DataSourceV2Relation is created, we create several places that need to
>> be in sync with the same convention. On the other hand, passing
>> TableIdentifier to DataSourceV2Relation and relying on the relation to
>> correctly set the options passed to readers and writers minimizes the
>> number of places that conversion needs to happen.
>>
>> rb
>> 
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2: support for named tables

Posted by Michael Armbrust <mi...@databricks.com>.

I am definitely in favor of first-class / consistent support for tables and
data sources.

One thing that is not clear to me from this proposal is exactly what the
interfaces are between:
 - Spark
 - A (The?) metastore
 - A data source

If we pass in the table identifier is the data source then responsible for
talking directly to the metastore? Is that what we want? (I'm not sure)

On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> There are two main ways to load tables in Spark: by name (db.table) and by
> a path. Unfortunately, the integration for DataSourceV2 has no support for
> identifying tables by name.
>
> I propose supporting the use of TableIdentifier, which is the standard
> way to pass around table names.
>
> The reason I think we should do this is to easily support more ways of
> working with DataSourceV2 tables. SQL statements and parts of the
> DataFrameReader and DataFrameWriter APIs that use table names create
> UnresolvedRelation instances that wrap an unresolved TableIdentifier.
>
> By adding support for passing TableIdentifier to a DataSourceV2Relation,
> then about all we need to enable these code paths is to add a resolution
> rule. For that rule, we could easily identify a default data source that
> handles named tables.
>
> This is what we’re doing in our Spark build, and we have DataSourceV2
> tables working great through SQL. (Part of this depends on the logical plan
> changes from my previous email to ensure inserts are properly resolved.)
>
> In the long term, I think we should update how we parse tables so that
> TableIdentifier can contain a source in addition to a database/context
> and a table name. That would allow us to integration new sources fairly
> seamlessly, without needing to a rather redundant SQL create statement like
> this:
>
> CREATE TABLE database.name USING source OPTIONS (table 'database.name')
>
> Also, I think we should pass TableIdentifier to DataSourceV2Relation,
> rather than going with Wenchen’s suggestion that we pass the table name as
> a string property, “table”. My rationale is that the new API shouldn’t leak
> its internal details to other parts of the planner.
>
> If we were to convert TableIdentifer to a “table” property wherever
> DataSourceV2Relation is created, we create several places that need to be
> in sync with the same convention. On the other hand, passing
> TableIdentifier to DataSourceV2Relation and relying on the relation to
> correctly set the options passed to readers and writers minimizes the
> number of places that conversion needs to happen.
>
> rb
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>