You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2018/11/26 22:54:16 UTC

DataSourceV2 community sync #3

Hi everyone,

I just sent out an invite for the next DSv2 community sync for Wednesday,
28 Nov at 5PM PST.

We have a few topics left over from last time to cover. A few people wanted
to cover catalog APIs, so I put two items on the agenda:

   - The TableCatalog proposal (and other catalog APIs)
   - Using CatalogTableIdentifier to separate v1 and v2 code paths and
   avoid unintended behavior changes

As I noted in the summary last time, please send topics ahead of time so we
can get started more quickly.

If you would like to be added to the google hangout invite, please let me
know and I’ll add you. Thanks!

rb
-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

Thank you Ryan and Xiao – sharing all this info really gives a very good insight!

From: Ryan Blue <rb...@netflix.com>
Reply-To: "rblue@netflix.com" <rb...@netflix.com>
Date: Monday, December 3, 2018 at 12:05 PM
To: "Thakrar, Jayesh" <jt...@conversantmedia.com>
Cc: Xiao Li <ga...@gmail.com>, Spark Dev List <de...@spark.apache.org>
Subject: Re: DataSourceV2 community sync #3

Jayesh,

I don’t think this need is very narrow.

To have reliable behavior for CTAS, you need to:
1.       Check whether a table exists and fail. Right now, it is up to the source whether to continue with the write if the table already exists or to throw an exception, which is unreliable across sources.
2.       Create a table if it doesn’t exist.
3.       Drop the table if writing failed. In the current implementation, this can’t be done reliably because #1 is unreliable. So a failed CTAS has a side-effect that the table is created in some cases and a subsequent retry can fail because the table exists.

Leaving these operations up to the read/write API is why behavior isn’t consistent today. It also increases the amount of work that a source needs to do and mixes concerns (what to do in a write when the table doesn’t exist). Spark is going to be a lot more predictable if we decompose the behavior of these operations into create, drop, write, etc.

And in addition to CTAS, we want these operations to be exposed for sources. If Spark can create a table, why wouldn’t you be able to run DROP TABLE to remove it?

Last, Spark must be able to interact with the source of truth for tables. If Spark can’t create a table in Cassandra, it should reject a CTAS operation.

On Mon, Dec 3, 2018 at 9:52 AM Thakrar, Jayesh <jt...@conversantmedia.com>> wrote:
Thank you Xiao – I was wondering what was the motivation for the catalog.
If CTAS is the only candidate, would it suffice to have that as part of the data source interface only?

If we look at BI, ETL and reporting tools which interface with many tables from different data sources at the same time, it makes sense to have a metadata catalog as the catalog is used to “design” the work for that tool (e.g. ETL processing unit, etc). Furthermore, the catalog serves as a data mapping to map external data types to the tool’s data types.

Is the vision to move in that direction for Spark with the catalog support/feature?
Also, is the vision to also incorporate the “options” specified for the data source into the catalog too?
That may be helpful in some situations (e.g. the JDBC connect string being available from the catalog).
From: Xiao Li <ga...@gmail.com>>
Date: Monday, December 3, 2018 at 10:44 AM
To: "Thakrar, Jayesh" <jt...@conversantmedia.com>>
Cc: Ryan Blue <rb...@netflix.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: DataSourceV2 community sync #3

Hi, Jayesh,

This is a good question. Spark is a unified analytics engine for various data sources. We are able to get the table schema from the underlying data sources via our data source APIs. Thus, it resolves most of the user requirements. Spark does not need the other info (like database, function, and views) that are stored in the local catalog. Note, Spark is not a query engine for a specific data source. Thus, we did not accept any public API that does not have an implementation in the past. I believe this still holds.

The catalog is part of the Spark SQL in the initial design and implementation. For the data sources that do not have catalog, they can use our catalog as a single source of truth. If they already have their own catalog, normally, they use the underlying data sources as the single source of truth. The table metadata in the Spark catalog is kind of a view of their physical schema that are stored in their local catalog. To support an atomic CREATE TABLE AS SELECT that requires modifying the catalog and data, we can add an interface for data sources but that is not part of catalog interface. The CTAS will not bypass our catalog. We will still register it in our catalog and the schema may or may not be stored in our catalog.

Will we define a super-feature catalog that can support all the data sources?

Based on my understanding, it is very hard. The priority is low based on our current scope of Spark SQL. If you want to do it, your design needs to consider how it works between global and local catalogs. This also requires a SPIP and voting. If you want to develop it incrementally without a design, I would suggest you to do it in your own fork. In the past, Spark on K8S was developed in a separate fork and then merged to the upstream of Apache Spark.

Welcome your contributions and let us make Spark great!

Cheers,

Xiao

Thakrar, Jayesh <jt...@conversantmedia.com>> 于2018年12月1日周六 下午9:10写道：
Just curious on the need for a catalog within Spark.

So Spark interface with other systems – many of which have a catalog of their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS, filesyststem, etc).
So what is the purpose of having this catalog within Spark for tables defined in Spark (which could be a front for other “catalogs”)?
Is it trying to fulfill some void/need…..
Also, would the Spark catalog be the common denominator of the other catalogs (least featured) or a super-feature catalog?

From: Xiao Li <ga...@gmail.com>>
Date: Saturday, December 1, 2018 at 10:49 PM
To: Ryan Blue <rb...@netflix.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: DataSourceV2 community sync #3

Hi, Ryan,

Let us first focus on answering the most fundamental problem before discussing various related topics. What is a catalog in Spark SQL?

My definition of catalog is based on the database catalog. Basically, the catalog provides a service that manage the metadata/definitions of database objects (e.g., database, views, tables, functions, user roles, and so on).

In Spark SQL, all the external objects accessed through our data source APIs are called "tables". I do not think we will expand the support in the near future. That means, the metadata we need from the external data sources are for table only.

These data sources should not use the Catalog identifier to identify. That means, in "catalog.database.table", catalog is only used to identify the actual catalog instead of data sources.

For a Spark cluster, we could mount multiple catalogs (e.g., hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could get the metadata of the tables, database, functions by accessing different catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2", "glue.db3.tab2". In the future, if Spark has its own catalog implementation, we might have something like, "spark_catalog1.db3.tab2". The catalog will be used for registering all the external data sources, various Spark UDFs and so on.

At the same time, we should NOT mix the table-level data sources with catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1", "Hbase3.db1.tab2" will not appear.

Do you agree on my definition of catalog in Spark SQL?

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年12月1日周六 下午7:25写道：

I try to avoid discussing each specific topic about the catalog federation before we deciding the framework of multi-catalog supports.

I’ve tried to open discussions on this for the last 6+ months because we need it. I understand that you’d like a comprehensive plan for supporting more than one catalog before moving forward, but I think most of us are okay with the incremental approach. It’s better to make progress toward the goal.

In general, data source API V2 and catalog API should be orthogonal
I agree with you, and they are. The API that Wenchen is working on for reading and writing data and the TableCatalog API are orthogonal efforts. As I said, they only overlap with the Table interface, and clearly tables loaded from a catalog need to be able to plug into the read/write API.

The reason these two efforts are related is that the community voted to standardize logical plans<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. Those standard plans have well-defined behavior for operations like CTAS, instead of relying on the data source plugin to do … something undefined. To implement this, we need a way for Spark to create tables, drop tables, etc. That’s why we need a way for sources to plug in Table-related catalog operations. (Sorry if this was already clear. I know I talked about it at length in the first v2 sync up.)

While the two APIs are orthogonal and serve different purposes, implementing common operations requires that we have both.

I would not call it a table catalog. I do not expect the data source should/need to implement a catalog. Since you might want an atomic CTAS, we can improve the table metadata resolution logic to support it with different resolution priorities. For example, try to get the metadata from the external data source, if the table metadata is not available in the catalog.

It sounds like your definition of a “catalog” is different. I think you’re referring to a global catalog? Could you explain what you’re talking about here?

I’m talking about an API to interface with an external data source, which I think we need for the reasons I outlined above. I don’t care what we call it, but your comment seems to hint that there would be an API to look up tables in external sources. That’s the thing I’m talking about.

CatalogTableIdentifier: The PR is doing nothing but adding an interface.

Yes. I opened this PR to discuss how Spark should track tables from different catalogs and avoid passing those references to code paths that don’t support them. The use of table identifiers with a catalog part was discussed in the “Multiple catalog support” thread. I’ve also brought it up and pointed out how I think it should be used in syncs a couple of times.

Sorry if this discussion isn’t how you would have done it, but it’s a fairly simple idea that I don’t think requires its own doc.

On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com>> wrote:
Hi, Ryan,

I try to avoid discussing each specific topic about the catalog federation before we deciding the framework of multi-catalog supports.

-  CatalogTableIdentifier: The PR https://github.com/apache/spark/pull/21978 is doing nothing but adding an interface. In the PR, we did not discuss how to resolve it, any restriction on the naming and what is a catalog.This requires more doc for explaining it. For example, https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017 Normally, we do not merge a PR without showing how to use it.

- TableCatalog: First, I would not call it a table catalog. I do not expect the data source should/need to implement a catalog. Since you might want an atomic CTAS, we can improve the table metadata resolution logic to support it with different resolution priorities. For example, try to get the metadata from the external data source, if the table metadata is not available in the catalog. However, the catalog should do what the catalog is expected to do. If we follow what our data source API V2 is doing, basically, the data source is just a table. It is not related to database, view, or function. Mixing catalog with data source API V2 just makes the whole things more complex.

In general, data source API V2 and catalog API should be orthogonal. I believe the data source API V2 and catalog APIs are two separate projects. Hopefully, you understand my concern. If we really want to mix them together, I want to read the design of your multi-catalog support and understand more details.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年12月1日周六 下午3:22写道：
Xiao,

I do have opinions about how multi-catalog support should work, but I don't think we are at a point where there is consensus. That's why I've started discussion threads and added the CatalogTableIdentifier PR instead of a comprehensive design doc. You have opinions about how users should interact with catalogs as well (your "federated catalog") and we should discuss our options here.

But the crucial point is that the user interaction doesn't need to be completely decided in order to move forward. A design for multi-catalog support isn't what we need right now; we need an API that plugins can implement to expose table operations.

I've proposed that API, TableCatalog, and a way to manage catalog plugins. I've made an argument for why I think that API is flexible enough for the task and still fairly simple.

I think that we can add TableCatalog now and work on multi-catalog support incrementally, and I have yet to hear your argument for why that is not the case.

rb

On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com>> wrote:
Hi, Ryan,

Catalog is a really important component for Spark SQL or any analytics platform, I have to emphasize. Thus, a careful design is needed to ensure it works as expected. Based on my previous discussion with many community members, Spark SQL needs a catalog interface so that we can mount multiple external physical catalogs and they can be presented as a single logical catalog [which is a so-called global federated catalog]. In the future, we can use this interface to develop our own catalog (instead of Hive metastore) for more efficient metadata management. We can also plug in ACL management if needed.

Based on your previous answers, it sounds like you have many ideas in your mind about building a Catalog interface for Spark SQL, but it is not shown in the design doc. Could you write them down in a single doc? We can try to leave comments in the design doc, instead of discussing various issues in PRs, emails and meetings. It can also help the whole community understand your proposal and post their comments.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年11月29日周四 下午7:06写道：

Xiao,

For the questions in this last email about how catalogs interact and how functions and other future features work: we discussed those last night. As I said then, I think that the right approach is incremental. We don’t want to design all of that in one gigantic proposal up front. To do that is to put ourselves into analysis paralysis.

We don’t have a design for how catalogs interact with one another, but I think we made a strong case for two points: first, that the proposed structure doesn’t preclude any of those future decisions (hence we should proceed incrementally). Second, that those situations aren’t that hard to think through if you’re concerned about them: functions that can run in Spark can be run on any data, functions that run in external sources cannot be run on any data.

You’re right that I haven’t completely covered your new questions. But to the questions in your first email:
•         You asked how, for example, Glue may be plugged in. That is well covered in the PR that adds catalogs as a plugin<https://github.com/apache/spark/pull/21306#issue-187572913>, the response I sent to Wenchen’s questions, and the earlier discussion thread I posted to this list with the subject “[DISCUSS] Multiple catalog support”. The short answer is that implementations are configured with Spark config properties and loaded with reflection.
•         You asked how users implement an external catalog without adding new data sources. That’s also covered in the “Multiple catalog support” proposal, the table catalog PR, and ongoing discussions on the v2 redesign. The answer is that a catalog returns a table instance that implements the various interfaces from Wenchen’s work. A table may implement them directly or return other existing implementations. Here’s how it worked in the old API<https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>.

I hope that you don’t think I expect you to go “without seeing the design”!

rb

On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com>> wrote:
Ryan,

All the proposal I read is only related to Table metadata. Catalog contains the metadata of database, functions, columns, views, and so on. When we have multiple catalogs, how these catalogs interact with each other? How the global catalog works? How a view, table, function, database and column is resolved? Do we have nickname, mapping, wrapper?

Or I might miss the design docs you send? Could you post the doc?

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年11月29日周四 下午3:06写道：
Xiao,

Please have a look at the pull requests and documents I've posted over the last few months.

If you still have questions about how you might plug in Glue, let me know and I can clarify.

rb

On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com>> wrote:
Ryan,

Thanks for leading the discussion and sending out the memo!

Xiao suggested that there are restrictions for how tables and functions interact. Because of this, he doesn’t think that separate TableCatalog and FunctionCatalog APIs are feasible.

Anything is possible. It depends on how we design the two interfaces. Now, most parts are unknown to me without seeing the design.

I think we need to see the user stories, and high-level design before working on a small portion of Catalog federation. We do not need an exhaustive design in the current stage, but we need to know how the new proposal works. For example, how to plug in a new Hive metastore? How to plug in a Glue? How do users implement a new external catalog without adding any new data sources? Without knowing more details, it is hard to say whether this TableCatalog can satisfy all the requirements.

Cheers,

Xiao

Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：

Hi everyone,

Here are my notes from last night’s sync. Some attendees that joined during discussion may be missing, since I made the list while we were waiting for people to join.

If you have topic suggestions for the next sync, please start sending them to me. Thank you!

Attendees:

Ryan Blue
John Zhuge
Jamison Bennett
Yuanjian Li
Xiao Li
stczwd
Matt Cheah
Wenchen Fan
Genglian Wang
Kevin Yu
Maryann Xue
Cody Koeninger
Bruce Robbins
Rohit Karlupia

Agenda:
•         Follow-up issues or discussion on Wenchen’s PR #23086
•         TableCatalog proposal
•         CatalogTableIdentifier

Notes:
•         Discussion about PR #23086
o    Where should the catalog API live since it needs to be accessible to catalyst rules, but the catalyst module is private?
o    Wenchen suggested creating a sql-api module for v2 API interfaces, making catalyst depend on it
o    Consensus was to use Wenchen’s suggestion
•         In discussion about #23086, Xiao asked how adding catalog to a table identifier will work
o    Background from Ryan: existing code paths use TableIdentifier and don’t expect a catalog portion. If an identifier with a catalog were passed to existing code, that code may use the default catalog not knowing that a different one was requested, which would be incorrect behavior.
o    Ryan: The proposal for CatalogTableIdentifier addresses this problem. TableIdentifier is used for identifiers that have no catalog set. By enforcing that requirement, passing a TableIdentifier to old code ensures that no catalogs leak into that code. This is also used when the catalog is set from context. For example, the TableCatalog API accepts only TableIdentifier because the catalog is already determined.
•         Xiao asked whether FunctionIdentifier needs to be updated in the same way as CatalogTableIdentifier.
o    Ryan: Yes, when a FunctionCatalog API is added
•         The remaining time was spent discussing whether the plan to incrementally replace the current catalog API will work. [Not great notes here, feel free to add your take in a reply]
o    Xiao suggested that there are restrictions for how tables and functions interact. Because of this, he doesn’t think that separate TableCatalog and FunctionCatalog APIs are feasible.
o    Wenchen and Ryan think that functions should be orthogonal to data sources
o    Matt and Ryan think that catalog design can be done incrementally as new interfaces (i.e. FunctionCatalog) are added and that the proposed TableCatalog does not preclude designing for Xiao’s concerns later
o    [I forget who] pointed out that there are restrictions in some databases for views from different sources
o    There was some discussion about when functions or views cannot be orthogonal. For example, where the code runs is important. Functions pushed to sources cannot necessarily be run on other sources and Spark functions cannot necessarily be pushed down to sources.
o    Xiao would like a full catalog replacement design, including views, databases, and functions and how they interact, before moving forward with the proposed TableCatalog API
o    Ryan [and Matt, I think] think that TableCatalog is compatible with future decisions and the best path forward is to build incrementally. An exhaustive design process blocks progress on v2.

On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com>> wrote:

Hi everyone,

I just sent out an invite for the next DSv2 community sync for Wednesday, 28 Nov at 5PM PST.

We have a few topics left over from last time to cover. A few people wanted to cover catalog APIs, so I put two items on the agenda:
•         The TableCatalog proposal (and other catalog APIs)
•         Using CatalogTableIdentifier to separate v1 and v2 code paths and avoid unintended behavior changes

As I noted in the summary last time, please send topics ahead of time so we can get started more quickly.

If you would like to be added to the google hangout invite, please let me know and I’ll add you. Thanks!

rb
--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Jayesh,

I don’t think this need is very narrow.

To have reliable behavior for CTAS, you need to:

   1. Check whether a table exists and fail. Right now, it is up to the
   source whether to continue with the write if the table already exists or to
   throw an exception, which is unreliable across sources.
   2. Create a table if it doesn’t exist.
   3. Drop the table if writing failed. In the current implementation, this
   can’t be done reliably because #1 is unreliable. So a failed CTAS has a
   side-effect that the table is created in some cases and a subsequent retry
   can fail because the table exists.

Leaving these operations up to the read/write API is why behavior isn’t
consistent today. It also increases the amount of work that a source needs
to do and mixes concerns (what to do in a write when the table doesn’t
exist). Spark is going to be a lot more predictable if we decompose the
behavior of these operations into create, drop, write, etc.

And in addition to CTAS, we want these operations to be exposed for
sources. If Spark can create a table, why wouldn’t you be able to run DROP
TABLE to remove it?

Last, Spark must be able to interact with the source of truth for tables.
If Spark can’t create a table in Cassandra, it should reject a CTAS
operation.

On Mon, Dec 3, 2018 at 9:52 AM Thakrar, Jayesh <jt...@conversantmedia.com>
wrote:

> Thank you Xiao – I was wondering what was the motivation for the catalog.
>
> If CTAS is the only candidate, would it suffice to have that as part of
> the data source interface only?
>
>
>
> If we look at BI, ETL and reporting tools which interface with many tables
> from different data sources at the same time, it makes sense to have a
> metadata catalog as the catalog is used to “design” the work for that tool
> (e.g. ETL processing unit, etc). Furthermore, the catalog serves as a data
> mapping to map external data types to the tool’s data types.
>
>
>
> Is the vision to move in that direction for Spark with the catalog
> support/feature?
>
> Also, is the vision to also incorporate the “options” specified for the
> data source into the catalog too?
>
> That may be helpful in some situations (e.g. the JDBC connect string being
> available from the catalog).
>
> *From: *Xiao Li <ga...@gmail.com>
> *Date: *Monday, December 3, 2018 at 10:44 AM
> *To: *"Thakrar, Jayesh" <jt...@conversantmedia.com>
> *Cc: *Ryan Blue <rb...@netflix.com>, "user@spark.apache.org" <
> dev@spark.apache.org>
> *Subject: *Re: DataSourceV2 community sync #3
>
>
>
> Hi, Jayesh,
>
>
>
> This is a good question. Spark is a unified analytics engine for various
> data sources. We are able to get the table schema from the underlying data
> sources via our data source APIs. Thus, it resolves most of the user
> requirements. Spark does not need the other info (like database, function,
> and views) that are stored in the local catalog. Note, Spark is not a query
> engine for a specific data source. Thus, we did not accept any public API
> that does not have an implementation in the past. I believe this still
> holds.
>
>
>
> The catalog is part of the Spark SQL in the initial design and
> implementation. For the data sources that do not have catalog, they can use
> our catalog as a single source of truth. If they already have their own
> catalog, normally, they use the underlying data sources as the single
> source of truth. The table metadata in the Spark catalog is kind of a view
> of their physical schema that are stored in their local catalog. To support
> an atomic CREATE TABLE AS SELECT that requires modifying the catalog and
> data, we can add an interface for data sources but that is not part of
> catalog interface. The CTAS will not bypass our catalog. We will still
> register it in our catalog and the schema may or may not be stored in our
> catalog.
>
>
>
> Will we define a super-feature catalog that can support all the data
> sources?
>
>
>
> Based on my understanding, it is very hard. The priority is low based on
> our current scope of Spark SQL. If you want to do it, your design needs to
> consider how it works between global and local catalogs. This also requires
> a SPIP and voting. If you want to develop it incrementally without a
> design, I would suggest you to do it in your own fork. In the past, Spark
> on K8S was developed in a separate fork and then merged to the upstream of
> Apache Spark.
>
>
>
> Welcome your contributions and let us make Spark great!
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Thakrar, Jayesh <jt...@conversantmedia.com> 于2018年12月1日周六 下午9:10写道：
>
> Just curious on the need for a catalog within Spark.
>
>
>
> So Spark interface with other systems – many of which have a catalog of
> their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS,
> filesyststem, etc).
>
> So what is the purpose of having this catalog within Spark for tables
> defined in Spark (which could be a front for other “catalogs”)?
>
> Is it trying to fulfill some void/need…..
>
> Also, would the Spark catalog be the common denominator of the other
> catalogs (least featured) or a super-feature catalog?
>
>
>
> *From: *Xiao Li <ga...@gmail.com>
> *Date: *Saturday, December 1, 2018 at 10:49 PM
> *To: *Ryan Blue <rb...@netflix.com>
> *Cc: *"user@spark.apache.org" <de...@spark.apache.org>
> *Subject: *Re: DataSourceV2 community sync #3
>
>
>
> Hi, Ryan,
>
>
>
> Let us first focus on answering the most fundamental problem before
> discussing various related topics. What is a catalog in Spark SQL?
>
>
>
> My definition of catalog is based on the database catalog. Basically, the
> catalog provides a service that manage the metadata/definitions of database
> objects (e.g., database, views, tables, functions, user roles, and so on).
>
>
>
> In Spark SQL, all the external objects accessed through our data source
> APIs are called "tables". I do not think we will expand the support in the
> near future. That means, the metadata we need from the external data
> sources are for table only.
>
>
>
> These data sources should not use the Catalog identifier to identify. That
> means, in "catalog.database.table", catalog is only used to identify the
> actual catalog instead of data sources.
>
>
>
> For a Spark cluster, we could mount multiple catalogs (e.g.,
> hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could
> get the metadata of the tables, database, functions by accessing different
> catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2",
> "glue.db3.tab2". In the future, if Spark has its own catalog
> implementation, we might have something like, "spark_catalog1.db3.tab2".
> The catalog will be used for registering all the external data sources,
> various Spark UDFs and so on.
>
>
>
> At the same time, we should NOT mix the table-level data sources with
> catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1",
> "Hbase3.db1.tab2" will not appear.
>
>
>
> Do you agree on my definition of catalog in Spark SQL?
>
>
>
> Xiao
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午7:25写道：
>
> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
> I’ve tried to open discussions on this for the last 6+ months because we
> need it. I understand that you’d like a comprehensive plan for supporting
> more than one catalog before moving forward, but I think most of us are
> okay with the incremental approach. It’s better to make progress toward the
> goal.
>
> In general, data source API V2 and catalog API should be orthogonal
> I agree with you, and they are. The API that Wenchen is working on for
> reading and writing data and the TableCatalog API are orthogonal efforts.
> As I said, they only overlap with the Table interface, and clearly tables
> loaded from a catalog need to be able to plug into the read/write API.
>
> The reason these two efforts are related is that the community voted to standardize
> logical plans
> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
> Those standard plans have well-defined behavior for operations like CTAS,
> instead of relying on the data source plugin to do … something undefined.
> To implement this, we need a way for Spark to create tables, drop tables,
> etc. That’s why we need a way for sources to plug in Table-related catalog
> operations. (Sorry if this was already clear. I know I talked about it at
> length in the first v2 sync up.)
>
> While the two APIs are orthogonal and serve different purposes,
> implementing common operations requires that we have both.
>
> I would not call it a table catalog. I do not expect the data source
> should/need to implement a catalog. Since you might want an atomic CTAS, we
> can improve the table metadata resolution logic to support it with
> different resolution priorities. For example, try to get the metadata from
> the external data source, if the table metadata is not available in the
> catalog.
>
> It sounds like your definition of a “catalog” is different. I think you’re
> referring to a global catalog? Could you explain what you’re talking about
> here?
>
> I’m talking about an API to interface with an external data source, which
> I think we need for the reasons I outlined above. I don’t care what we call
> it, but your comment seems to hint that there would be an API to look up
> tables in external sources. That’s the thing I’m talking about.
>
> CatalogTableIdentifier: The PR is doing nothing but adding an interface.
>
> Yes. I opened this PR to discuss how Spark should track tables from
> different catalogs and avoid passing those references to code paths that
> don’t support them. The use of table identifiers with a catalog part was
> discussed in the “Multiple catalog support” thread. I’ve also brought it up
> and pointed out how I think it should be used in syncs a couple of times.
>
> Sorry if this discussion isn’t how you would have done it, but it’s a
> fairly simple idea that I don’t think requires its own doc.
>
>
>
> On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Ryan,
>
>
>
> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
>
>
> -  *CatalogTableIdentifier*: The PR
> https://github.com/apache/spark/pull/21978 is doing nothing but adding an
> interface. In the PR, we did not discuss how to resolve it, any restriction
> on the naming and what is a catalog.This requires more doc for explaining
> it. For example,
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
> Normally, we do not merge a PR without showing how to use it.
>
>
>
> - *TableCatalog*: First, I would not call it a table catalog. I do not
> expect the data source should/need to implement a catalog. Since you might
> want an atomic CTAS, we can improve the table metadata resolution logic to
> support it with different resolution priorities. For example, try to get
> the metadata from the external data source, if the table metadata is not
> available in the catalog. However, the catalog should do what the catalog
> is expected to do. If we follow what our data source API V2 is doing,
> basically, the data source is just a table. It is not related to database,
> view, or function. Mixing catalog with data source API V2 just makes the
> whole things more complex.
>
>
>
> In general, data source API V2 and catalog API should be orthogonal. I
> believe the data source API V2 and catalog APIs are two separate projects.
> Hopefully, you understand my concern. If we really want to mix them
> together, I want to read the design of your multi-catalog support and
> understand more details.
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午3:22写道：
>
> Xiao,
>
>
>
> I do have opinions about how multi-catalog support should work, but I
> don't think we are at a point where there is consensus. That's why I've
> started discussion threads and added the CatalogTableIdentifier PR instead
> of a comprehensive design doc. You have opinions about how users should
> interact with catalogs as well (your "federated catalog") and we should
> discuss our options here.
>
>
>
> But the crucial point is that the user interaction doesn't need to be
> completely decided in order to move forward. A design for multi-catalog
> support isn't what we need right now; we need an API that plugins can
> implement to expose table operations.
>
>
>
> I've proposed that API, TableCatalog, and a way to manage catalog plugins.
> I've made an argument for why I think that API is flexible enough for the
> task and still fairly simple.
>
>
>
> I think that we can add TableCatalog now and work on multi-catalog support
> incrementally, and I have yet to hear your argument for why that is not the
> case.
>
>
>
> rb
>
>
>
> On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Ryan,
>
>
>
> Catalog is a really important component for Spark SQL or any analytics
> platform, I have to emphasize. Thus, a careful design is needed to ensure
> it works as expected. Based on my previous discussion with many community
> members, Spark SQL needs a catalog interface so that we can mount multiple
> external physical catalogs and they can be presented as a single logical
> catalog [which is a so-called global federated catalog]. In the future, we
> can use this interface to develop our own catalog (instead of Hive
> metastore) for more efficient metadata management. We can also plug in ACL
> management if needed.
>
>
>
> Based on your previous answers, it sounds like you have many ideas in your
> mind about building a Catalog interface for Spark SQL, but it is not shown
> in the design doc. Could you write them down in a single doc? We can try to
> leave comments in the design doc, instead of discussing various issues in
> PRs, emails and meetings. It can also help the whole community understand
> your proposal and post their comments.
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>
> Xiao,
>
> For the questions in this last email about how catalogs interact and how
> functions and other future features work: we discussed those last night. As
> I said then, I think that the right approach is incremental. We don’t want
> to design all of that in one gigantic proposal up front. To do that is to
> put ourselves into analysis paralysis.
>
> We don’t have a design for how catalogs interact with one another, but I
> think we made a strong case for two points: first, that the proposed
> structure doesn’t preclude any of those future decisions (hence we should
> proceed incrementally). Second, that those situations aren’t that hard to
> think through if you’re concerned about them: functions that can run in
> Spark can be run on any data, functions that run in external sources cannot
> be run on any data.
>
> You’re right that I haven’t completely covered your *new* questions. But
> to the questions in your first email:
>
> ·         You asked how, for example, Glue may be plugged in. That is
> well covered in the PR that adds catalogs as a plugin
> <https://github.com/apache/spark/pull/21306#issue-187572913>, the
> response I sent to Wenchen’s questions, and the earlier discussion thread I
> posted to this list with the subject “[DISCUSS] Multiple catalog support”.
> The short answer is that implementations are configured with Spark config
> properties and loaded with reflection.
>
> ·         You asked how users implement an external catalog without
> adding new data sources. That’s also covered in the “Multiple catalog
> support” proposal, the table catalog PR, and ongoing discussions on the v2
> redesign. The answer is that a catalog returns a table instance that
> implements the various interfaces from Wenchen’s work. A table may
> implement them directly or return other existing implementations. Here’s
> how it worked in the old API
> <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
> .
>
> I hope that you don’t think I expect you to go “without seeing the design”!
>
> rb
>
>
>
> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>
> Ryan,
>
>
>
> All the proposal I read is only related to Table metadata. Catalog
> contains the metadata of database, functions, columns, views, and so on.
> When we have multiple catalogs, how these catalogs interact with each
> other? How the global catalog works? How a view, table, function, database
> and column is resolved? Do we have nickname, mapping, wrapper?
>
>
>
> Or I might miss the design docs you send? Could you post the doc?
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>
> Xiao,
>
>
>
> Please have a look at the pull requests and documents I've posted over the
> last few months.
>
>
>
> If you still have questions about how you might plug in Glue, let me know
> and I can clarify.
>
>
>
> rb
>
>
>
> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>
> Ryan,
>
>
>
> Thanks for leading the discussion and sending out the memo!
>
>
>
> Xiao suggested that there are restrictions for how tables and functions
> interact. Because of this, he doesn’t think that separate TableCatalog and
> FunctionCatalog APIs are feasible.
>
>
>
> Anything is possible. It depends on how we design the two interfaces. Now,
> most parts are unknown to me without seeing the design.
>
>
>
> I think we need to see the user stories, and high-level design before
> working on a small portion of Catalog federation. We do not need an
> exhaustive design in the current stage, but we need to know how the new
> proposal works. For example, how to plug in a new Hive metastore? How to
> plug in a Glue? How do users implement a new external catalog without
> adding any new data sources? Without knowing more details, it is hard to
> say whether this TableCatalog can satisfy all the requirements.
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
>
>
> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>
> Hi everyone,
>
> Here are my notes from last night’s sync. Some attendees that joined
> during discussion may be missing, since I made the list while we were
> waiting for people to join.
>
> If you have topic suggestions for the next sync, please start sending them
> to me. Thank you!
>
> *Attendees:*
>
> Ryan Blue
> John Zhuge
> Jamison Bennett
> Yuanjian Li
> Xiao Li
> stczwd
> Matt Cheah
> Wenchen Fan
> Genglian Wang
> Kevin Yu
> Maryann Xue
> Cody Koeninger
> Bruce Robbins
> Rohit Karlupia
>
> *Agenda:*
>
> ·         Follow-up issues or discussion on Wenchen’s PR #23086
>
> ·         TableCatalog proposal
>
> ·         CatalogTableIdentifier
>
> *Notes:*
>
> ·         Discussion about PR #23086
>
> o    Where should the catalog API live since it needs to be accessible to
> catalyst rules, but the catalyst module is private?
>
> o    Wenchen suggested creating a sql-api module for v2 API interfaces,
> making catalyst depend on it
>
> o    Consensus was to use Wenchen’s suggestion
>
> ·         In discussion about #23086, Xiao asked how adding catalog to a
> table identifier will work
>
> o    Background from Ryan: existing code paths use TableIdentifier and
> don’t expect a catalog portion. If an identifier with a catalog were passed
> to existing code, that code may use the default catalog not knowing that a
> different one was requested, which would be incorrect behavior.
>
> o    Ryan: The proposal for CatalogTableIdentifier addresses this
> problem. TableIdentifier is used for identifiers that have no catalog set.
> By enforcing that requirement, passing a TableIdentifier to old code
> ensures that no catalogs leak into that code. This is also used when the
> catalog is set from context. For example, the TableCatalog API accepts only
> TableIdentifier because the catalog is already determined.
>
> ·         Xiao asked whether FunctionIdentifier needs to be updated in
> the same way as CatalogTableIdentifier.
>
> o    Ryan: Yes, when a FunctionCatalog API is added
>
> ·         The remaining time was spent discussing whether the plan to
> incrementally replace the current catalog API will work. [Not great notes
> here, feel free to add your take in a reply]
>
> o    Xiao suggested that there are restrictions for how tables and
> functions interact. Because of this, he doesn’t think that separate
> TableCatalog and FunctionCatalog APIs are feasible.
>
> o    Wenchen and Ryan think that functions should be orthogonal to data
> sources
>
> o    Matt and Ryan think that catalog design can be done incrementally as
> new interfaces (i.e. FunctionCatalog) are added and that the proposed
> TableCatalog does not preclude designing for Xiao’s concerns later
>
> o    [I forget who] pointed out that there are restrictions in some
> databases for views from different sources
>
> o    There was some discussion about when functions or views cannot be
> orthogonal. For example, where the code runs is important. Functions pushed
> to sources cannot necessarily be run on other sources and Spark functions
> cannot necessarily be pushed down to sources.
>
> o    Xiao would like a full catalog replacement design, including views,
> databases, and functions and how they interact, before moving forward with
> the proposed TableCatalog API
>
> o    Ryan [and Matt, I think] think that TableCatalog is compatible with
> future decisions and the best path forward is to build incrementally. An
> exhaustive design process blocks progress on v2.
>
>
>
> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>
> Hi everyone,
>
> I just sent out an invite for the next DSv2 community sync for Wednesday,
> 28 Nov at 5PM PST.
>
> We have a few topics left over from last time to cover. A few people
> wanted to cover catalog APIs, so I put two items on the agenda:
>
> ·         The TableCatalog proposal (and other catalog APIs)
>
> ·         Using CatalogTableIdentifier to separate v1 and v2 code paths
> and avoid unintended behavior changes
>
> As I noted in the summary last time, please send topics ahead of time so
> we can get started more quickly.
>
> If you would like to be added to the google hangout invite, please let me
> know and I’ll add you. Thanks!
>
> rb
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

Thank you Xiao – I was wondering what was the motivation for the catalog.
If CTAS is the only candidate, would it suffice to have that as part of the data source interface only?

If we look at BI, ETL and reporting tools which interface with many tables from different data sources at the same time, it makes sense to have a metadata catalog as the catalog is used to “design” the work for that tool (e.g. ETL processing unit, etc). Furthermore, the catalog serves as a data mapping to map external data types to the tool’s data types.

Is the vision to move in that direction for Spark with the catalog support/feature?
Also, is the vision to also incorporate the “options” specified for the data source into the catalog too?
That may be helpful in some situations (e.g. the JDBC connect string being available from the catalog).
From: Xiao Li <ga...@gmail.com>
Date: Monday, December 3, 2018 at 10:44 AM
To: "Thakrar, Jayesh" <jt...@conversantmedia.com>
Cc: Ryan Blue <rb...@netflix.com>, "user@spark.apache.org" <de...@spark.apache.org>
Subject: Re: DataSourceV2 community sync #3

Hi, Jayesh,

This is a good question. Spark is a unified analytics engine for various data sources. We are able to get the table schema from the underlying data sources via our data source APIs. Thus, it resolves most of the user requirements. Spark does not need the other info (like database, function, and views) that are stored in the local catalog. Note, Spark is not a query engine for a specific data source. Thus, we did not accept any public API that does not have an implementation in the past. I believe this still holds.

The catalog is part of the Spark SQL in the initial design and implementation. For the data sources that do not have catalog, they can use our catalog as a single source of truth. If they already have their own catalog, normally, they use the underlying data sources as the single source of truth. The table metadata in the Spark catalog is kind of a view of their physical schema that are stored in their local catalog. To support an atomic CREATE TABLE AS SELECT that requires modifying the catalog and data, we can add an interface for data sources but that is not part of catalog interface. The CTAS will not bypass our catalog. We will still register it in our catalog and the schema may or may not be stored in our catalog.

Will we define a super-feature catalog that can support all the data sources?

Based on my understanding, it is very hard. The priority is low based on our current scope of Spark SQL. If you want to do it, your design needs to consider how it works between global and local catalogs. This also requires a SPIP and voting. If you want to develop it incrementally without a design, I would suggest you to do it in your own fork. In the past, Spark on K8S was developed in a separate fork and then merged to the upstream of Apache Spark.

Welcome your contributions and let us make Spark great!

Cheers,

Xiao

Thakrar, Jayesh <jt...@conversantmedia.com>> 于2018年12月1日周六 下午9:10写道：
Just curious on the need for a catalog within Spark.

So Spark interface with other systems – many of which have a catalog of their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS, filesyststem, etc).
So what is the purpose of having this catalog within Spark for tables defined in Spark (which could be a front for other “catalogs”)?
Is it trying to fulfill some void/need…..
Also, would the Spark catalog be the common denominator of the other catalogs (least featured) or a super-feature catalog?

From: Xiao Li <ga...@gmail.com>>
Date: Saturday, December 1, 2018 at 10:49 PM
To: Ryan Blue <rb...@netflix.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: DataSourceV2 community sync #3

Hi, Ryan,

Let us first focus on answering the most fundamental problem before discussing various related topics. What is a catalog in Spark SQL?

My definition of catalog is based on the database catalog. Basically, the catalog provides a service that manage the metadata/definitions of database objects (e.g., database, views, tables, functions, user roles, and so on).

In Spark SQL, all the external objects accessed through our data source APIs are called "tables". I do not think we will expand the support in the near future. That means, the metadata we need from the external data sources are for table only.

These data sources should not use the Catalog identifier to identify. That means, in "catalog.database.table", catalog is only used to identify the actual catalog instead of data sources.

For a Spark cluster, we could mount multiple catalogs (e.g., hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could get the metadata of the tables, database, functions by accessing different catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2", "glue.db3.tab2". In the future, if Spark has its own catalog implementation, we might have something like, "spark_catalog1.db3.tab2". The catalog will be used for registering all the external data sources, various Spark UDFs and so on.

At the same time, we should NOT mix the table-level data sources with catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1", "Hbase3.db1.tab2" will not appear.

Do you agree on my definition of catalog in Spark SQL?

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年12月1日周六 下午7:25写道：

I try to avoid discussing each specific topic about the catalog federation before we deciding the framework of multi-catalog supports.

I’ve tried to open discussions on this for the last 6+ months because we need it. I understand that you’d like a comprehensive plan for supporting more than one catalog before moving forward, but I think most of us are okay with the incremental approach. It’s better to make progress toward the goal.

In general, data source API V2 and catalog API should be orthogonal
I agree with you, and they are. The API that Wenchen is working on for reading and writing data and the TableCatalog API are orthogonal efforts. As I said, they only overlap with the Table interface, and clearly tables loaded from a catalog need to be able to plug into the read/write API.

The reason these two efforts are related is that the community voted to standardize logical plans<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. Those standard plans have well-defined behavior for operations like CTAS, instead of relying on the data source plugin to do … something undefined. To implement this, we need a way for Spark to create tables, drop tables, etc. That’s why we need a way for sources to plug in Table-related catalog operations. (Sorry if this was already clear. I know I talked about it at length in the first v2 sync up.)

While the two APIs are orthogonal and serve different purposes, implementing common operations requires that we have both.

I would not call it a table catalog. I do not expect the data source should/need to implement a catalog. Since you might want an atomic CTAS, we can improve the table metadata resolution logic to support it with different resolution priorities. For example, try to get the metadata from the external data source, if the table metadata is not available in the catalog.

It sounds like your definition of a “catalog” is different. I think you’re referring to a global catalog? Could you explain what you’re talking about here?

I’m talking about an API to interface with an external data source, which I think we need for the reasons I outlined above. I don’t care what we call it, but your comment seems to hint that there would be an API to look up tables in external sources. That’s the thing I’m talking about.

CatalogTableIdentifier: The PR is doing nothing but adding an interface.

Yes. I opened this PR to discuss how Spark should track tables from different catalogs and avoid passing those references to code paths that don’t support them. The use of table identifiers with a catalog part was discussed in the “Multiple catalog support” thread. I’ve also brought it up and pointed out how I think it should be used in syncs a couple of times.

Sorry if this discussion isn’t how you would have done it, but it’s a fairly simple idea that I don’t think requires its own doc.

On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com>> wrote:
Hi, Ryan,

I try to avoid discussing each specific topic about the catalog federation before we deciding the framework of multi-catalog supports.

-  CatalogTableIdentifier: The PR https://github.com/apache/spark/pull/21978 is doing nothing but adding an interface. In the PR, we did not discuss how to resolve it, any restriction on the naming and what is a catalog.This requires more doc for explaining it. For example, https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017 Normally, we do not merge a PR without showing how to use it.

- TableCatalog: First, I would not call it a table catalog. I do not expect the data source should/need to implement a catalog. Since you might want an atomic CTAS, we can improve the table metadata resolution logic to support it with different resolution priorities. For example, try to get the metadata from the external data source, if the table metadata is not available in the catalog. However, the catalog should do what the catalog is expected to do. If we follow what our data source API V2 is doing, basically, the data source is just a table. It is not related to database, view, or function. Mixing catalog with data source API V2 just makes the whole things more complex.

In general, data source API V2 and catalog API should be orthogonal. I believe the data source API V2 and catalog APIs are two separate projects. Hopefully, you understand my concern. If we really want to mix them together, I want to read the design of your multi-catalog support and understand more details.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年12月1日周六 下午3:22写道：
Xiao,

I do have opinions about how multi-catalog support should work, but I don't think we are at a point where there is consensus. That's why I've started discussion threads and added the CatalogTableIdentifier PR instead of a comprehensive design doc. You have opinions about how users should interact with catalogs as well (your "federated catalog") and we should discuss our options here.

But the crucial point is that the user interaction doesn't need to be completely decided in order to move forward. A design for multi-catalog support isn't what we need right now; we need an API that plugins can implement to expose table operations.

I've proposed that API, TableCatalog, and a way to manage catalog plugins. I've made an argument for why I think that API is flexible enough for the task and still fairly simple.

I think that we can add TableCatalog now and work on multi-catalog support incrementally, and I have yet to hear your argument for why that is not the case.

rb

On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com>> wrote:
Hi, Ryan,

Catalog is a really important component for Spark SQL or any analytics platform, I have to emphasize. Thus, a careful design is needed to ensure it works as expected. Based on my previous discussion with many community members, Spark SQL needs a catalog interface so that we can mount multiple external physical catalogs and they can be presented as a single logical catalog [which is a so-called global federated catalog]. In the future, we can use this interface to develop our own catalog (instead of Hive metastore) for more efficient metadata management. We can also plug in ACL management if needed.

Based on your previous answers, it sounds like you have many ideas in your mind about building a Catalog interface for Spark SQL, but it is not shown in the design doc. Could you write them down in a single doc? We can try to leave comments in the design doc, instead of discussing various issues in PRs, emails and meetings. It can also help the whole community understand your proposal and post their comments.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年11月29日周四 下午7:06写道：

Xiao,

For the questions in this last email about how catalogs interact and how functions and other future features work: we discussed those last night. As I said then, I think that the right approach is incremental. We don’t want to design all of that in one gigantic proposal up front. To do that is to put ourselves into analysis paralysis.

We don’t have a design for how catalogs interact with one another, but I think we made a strong case for two points: first, that the proposed structure doesn’t preclude any of those future decisions (hence we should proceed incrementally). Second, that those situations aren’t that hard to think through if you’re concerned about them: functions that can run in Spark can be run on any data, functions that run in external sources cannot be run on any data.

You’re right that I haven’t completely covered your new questions. But to the questions in your first email:
•         You asked how, for example, Glue may be plugged in. That is well covered in the PR that adds catalogs as a plugin<https://github.com/apache/spark/pull/21306#issue-187572913>, the response I sent to Wenchen’s questions, and the earlier discussion thread I posted to this list with the subject “[DISCUSS] Multiple catalog support”. The short answer is that implementations are configured with Spark config properties and loaded with reflection.
•         You asked how users implement an external catalog without adding new data sources. That’s also covered in the “Multiple catalog support” proposal, the table catalog PR, and ongoing discussions on the v2 redesign. The answer is that a catalog returns a table instance that implements the various interfaces from Wenchen’s work. A table may implement them directly or return other existing implementations. Here’s how it worked in the old API<https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>.

I hope that you don’t think I expect you to go “without seeing the design”!

rb

On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com>> wrote:
Ryan,

All the proposal I read is only related to Table metadata. Catalog contains the metadata of database, functions, columns, views, and so on. When we have multiple catalogs, how these catalogs interact with each other? How the global catalog works? How a view, table, function, database and column is resolved? Do we have nickname, mapping, wrapper?

Or I might miss the design docs you send? Could you post the doc?

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年11月29日周四 下午3:06写道：
Xiao,

Please have a look at the pull requests and documents I've posted over the last few months.

If you still have questions about how you might plug in Glue, let me know and I can clarify.

rb

On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com>> wrote:
Ryan,

Thanks for leading the discussion and sending out the memo!

Xiao suggested that there are restrictions for how tables and functions interact. Because of this, he doesn’t think that separate TableCatalog and FunctionCatalog APIs are feasible.

Anything is possible. It depends on how we design the two interfaces. Now, most parts are unknown to me without seeing the design.

I think we need to see the user stories, and high-level design before working on a small portion of Catalog federation. We do not need an exhaustive design in the current stage, but we need to know how the new proposal works. For example, how to plug in a new Hive metastore? How to plug in a Glue? How do users implement a new external catalog without adding any new data sources? Without knowing more details, it is hard to say whether this TableCatalog can satisfy all the requirements.

Cheers,

Xiao

Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：

Hi everyone,

Here are my notes from last night’s sync. Some attendees that joined during discussion may be missing, since I made the list while we were waiting for people to join.

If you have topic suggestions for the next sync, please start sending them to me. Thank you!

Attendees:

Ryan Blue
John Zhuge
Jamison Bennett
Yuanjian Li
Xiao Li
stczwd
Matt Cheah
Wenchen Fan
Genglian Wang
Kevin Yu
Maryann Xue
Cody Koeninger
Bruce Robbins
Rohit Karlupia

Agenda:
•         Follow-up issues or discussion on Wenchen’s PR #23086
•         TableCatalog proposal
•         CatalogTableIdentifier

Notes:
•         Discussion about PR #23086
o    Where should the catalog API live since it needs to be accessible to catalyst rules, but the catalyst module is private?
o    Wenchen suggested creating a sql-api module for v2 API interfaces, making catalyst depend on it
o    Consensus was to use Wenchen’s suggestion
•         In discussion about #23086, Xiao asked how adding catalog to a table identifier will work
o    Background from Ryan: existing code paths use TableIdentifier and don’t expect a catalog portion. If an identifier with a catalog were passed to existing code, that code may use the default catalog not knowing that a different one was requested, which would be incorrect behavior.
o    Ryan: The proposal for CatalogTableIdentifier addresses this problem. TableIdentifier is used for identifiers that have no catalog set. By enforcing that requirement, passing a TableIdentifier to old code ensures that no catalogs leak into that code. This is also used when the catalog is set from context. For example, the TableCatalog API accepts only TableIdentifier because the catalog is already determined.
•         Xiao asked whether FunctionIdentifier needs to be updated in the same way as CatalogTableIdentifier.
o    Ryan: Yes, when a FunctionCatalog API is added
•         The remaining time was spent discussing whether the plan to incrementally replace the current catalog API will work. [Not great notes here, feel free to add your take in a reply]
o    Xiao suggested that there are restrictions for how tables and functions interact. Because of this, he doesn’t think that separate TableCatalog and FunctionCatalog APIs are feasible.
o    Wenchen and Ryan think that functions should be orthogonal to data sources
o    Matt and Ryan think that catalog design can be done incrementally as new interfaces (i.e. FunctionCatalog) are added and that the proposed TableCatalog does not preclude designing for Xiao’s concerns later
o    [I forget who] pointed out that there are restrictions in some databases for views from different sources
o    There was some discussion about when functions or views cannot be orthogonal. For example, where the code runs is important. Functions pushed to sources cannot necessarily be run on other sources and Spark functions cannot necessarily be pushed down to sources.
o    Xiao would like a full catalog replacement design, including views, databases, and functions and how they interact, before moving forward with the proposed TableCatalog API
o    Ryan [and Matt, I think] think that TableCatalog is compatible with future decisions and the best path forward is to build incrementally. An exhaustive design process blocks progress on v2.

On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com>> wrote:

Hi everyone,

I just sent out an invite for the next DSv2 community sync for Wednesday, 28 Nov at 5PM PST.

We have a few topics left over from last time to cover. A few people wanted to cover catalog APIs, so I put two items on the agenda:
•         The TableCatalog proposal (and other catalog APIs)
•         Using CatalogTableIdentifier to separate v1 and v2 code paths and avoid unintended behavior changes

As I noted in the summary last time, please send topics ahead of time so we can get started more quickly.

If you would like to be added to the google hangout invite, please let me know and I’ll add you. Thanks!

rb
--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Xiao Li <ga...@gmail.com>.

Hi, Jayesh,

This is a good question. Spark is a unified analytics engine for various
data sources. We are able to get the table schema from the underlying data
sources via our data source APIs. Thus, it resolves most of the user
requirements. Spark does not need the other info (like database, function,
and views) that are stored in the local catalog. Note, Spark is not a query
engine for a specific data source. Thus, we did not accept any public API
that does not have an implementation in the past. I believe this still
holds.

The catalog is part of the Spark SQL in the initial design and
implementation. For the data sources that do not have catalog, they can use
our catalog as a single source of truth. If they already have their own
catalog, normally, they use the underlying data sources as the single
source of truth. The table metadata in the Spark catalog is kind of a view
of their physical schema that are stored in their local catalog. To support
an atomic CREATE TABLE AS SELECT that requires modifying the catalog and
data, we can add an interface for data sources but that is not part of
catalog interface. The CTAS will not bypass our catalog. We will still
register it in our catalog and the schema may or may not be stored in our
catalog.

Will we define a super-feature catalog that can support all the data
> sources?


Based on my understanding, it is very hard. The priority is low based on
our current scope of Spark SQL. If you want to do it, your design needs to
consider how it works between global and local catalogs. This also requires
a SPIP and voting. If you want to develop it incrementally without a
design, I would suggest you to do it in your own fork. In the past, Spark
on K8S was developed in a separate fork and then merged to the upstream of
Apache Spark.

Welcome your contributions and let us make Spark great!

Cheers,

Xiao

Thakrar, Jayesh <jt...@conversantmedia.com> 于2018年12月1日周六 下午9:10写道：

> Just curious on the need for a catalog within Spark.
>
>
>
> So Spark interface with other systems – many of which have a catalog of
> their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS,
> filesyststem, etc).
>
> So what is the purpose of having this catalog within Spark for tables
> defined in Spark (which could be a front for other “catalogs”)?
>
> Is it trying to fulfill some void/need…..
>
> Also, would the Spark catalog be the common denominator of the other
> catalogs (least featured) or a super-feature catalog?
>
>
>
> *From: *Xiao Li <ga...@gmail.com>
> *Date: *Saturday, December 1, 2018 at 10:49 PM
> *To: *Ryan Blue <rb...@netflix.com>
> *Cc: *"user@spark.apache.org" <de...@spark.apache.org>
> *Subject: *Re: DataSourceV2 community sync #3
>
>
>
> Hi, Ryan,
>
>
>
> Let us first focus on answering the most fundamental problem before
> discussing various related topics. What is a catalog in Spark SQL?
>
>
>
> My definition of catalog is based on the database catalog. Basically, the
> catalog provides a service that manage the metadata/definitions of database
> objects (e.g., database, views, tables, functions, user roles, and so on).
>
>
>
> In Spark SQL, all the external objects accessed through our data source
> APIs are called "tables". I do not think we will expand the support in the
> near future. That means, the metadata we need from the external data
> sources are for table only.
>
>
>
> These data sources should not use the Catalog identifier to identify. That
> means, in "catalog.database.table", catalog is only used to identify the
> actual catalog instead of data sources.
>
>
>
> For a Spark cluster, we could mount multiple catalogs (e.g.,
> hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could
> get the metadata of the tables, database, functions by accessing different
> catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2",
> "glue.db3.tab2". In the future, if Spark has its own catalog
> implementation, we might have something like, "spark_catalog1.db3.tab2".
> The catalog will be used for registering all the external data sources,
> various Spark UDFs and so on.
>
>
>
> At the same time, we should NOT mix the table-level data sources with
> catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1",
> "Hbase3.db1.tab2" will not appear.
>
>
>
> Do you agree on my definition of catalog in Spark SQL?
>
>
>
> Xiao
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午7:25写道：
>
> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
> I’ve tried to open discussions on this for the last 6+ months because we
> need it. I understand that you’d like a comprehensive plan for supporting
> more than one catalog before moving forward, but I think most of us are
> okay with the incremental approach. It’s better to make progress toward the
> goal.
>
> In general, data source API V2 and catalog API should be orthogonal
> I agree with you, and they are. The API that Wenchen is working on for
> reading and writing data and the TableCatalog API are orthogonal efforts.
> As I said, they only overlap with the Table interface, and clearly tables
> loaded from a catalog need to be able to plug into the read/write API.
>
> The reason these two efforts are related is that the community voted to standardize
> logical plans
> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
> Those standard plans have well-defined behavior for operations like CTAS,
> instead of relying on the data source plugin to do … something undefined.
> To implement this, we need a way for Spark to create tables, drop tables,
> etc. That’s why we need a way for sources to plug in Table-related catalog
> operations. (Sorry if this was already clear. I know I talked about it at
> length in the first v2 sync up.)
>
> While the two APIs are orthogonal and serve different purposes,
> implementing common operations requires that we have both.
>
> I would not call it a table catalog. I do not expect the data source
> should/need to implement a catalog. Since you might want an atomic CTAS, we
> can improve the table metadata resolution logic to support it with
> different resolution priorities. For example, try to get the metadata from
> the external data source, if the table metadata is not available in the
> catalog.
>
> It sounds like your definition of a “catalog” is different. I think you’re
> referring to a global catalog? Could you explain what you’re talking about
> here?
>
> I’m talking about an API to interface with an external data source, which
> I think we need for the reasons I outlined above. I don’t care what we call
> it, but your comment seems to hint that there would be an API to look up
> tables in external sources. That’s the thing I’m talking about.
>
> CatalogTableIdentifier: The PR is doing nothing but adding an interface.
>
> Yes. I opened this PR to discuss how Spark should track tables from
> different catalogs and avoid passing those references to code paths that
> don’t support them. The use of table identifiers with a catalog part was
> discussed in the “Multiple catalog support” thread. I’ve also brought it up
> and pointed out how I think it should be used in syncs a couple of times.
>
> Sorry if this discussion isn’t how you would have done it, but it’s a
> fairly simple idea that I don’t think requires its own doc.
>
>
>
> On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Ryan,
>
>
>
> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
>
>
> -  *CatalogTableIdentifier*: The PR
> https://github.com/apache/spark/pull/21978 is doing nothing but adding an
> interface. In the PR, we did not discuss how to resolve it, any restriction
> on the naming and what is a catalog.This requires more doc for explaining
> it. For example,
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
> Normally, we do not merge a PR without showing how to use it.
>
>
>
> - *TableCatalog*: First, I would not call it a table catalog. I do not
> expect the data source should/need to implement a catalog. Since you might
> want an atomic CTAS, we can improve the table metadata resolution logic to
> support it with different resolution priorities. For example, try to get
> the metadata from the external data source, if the table metadata is not
> available in the catalog. However, the catalog should do what the catalog
> is expected to do. If we follow what our data source API V2 is doing,
> basically, the data source is just a table. It is not related to database,
> view, or function. Mixing catalog with data source API V2 just makes the
> whole things more complex.
>
>
>
> In general, data source API V2 and catalog API should be orthogonal. I
> believe the data source API V2 and catalog APIs are two separate projects.
> Hopefully, you understand my concern. If we really want to mix them
> together, I want to read the design of your multi-catalog support and
> understand more details.
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午3:22写道：
>
> Xiao,
>
>
>
> I do have opinions about how multi-catalog support should work, but I
> don't think we are at a point where there is consensus. That's why I've
> started discussion threads and added the CatalogTableIdentifier PR instead
> of a comprehensive design doc. You have opinions about how users should
> interact with catalogs as well (your "federated catalog") and we should
> discuss our options here.
>
>
>
> But the crucial point is that the user interaction doesn't need to be
> completely decided in order to move forward. A design for multi-catalog
> support isn't what we need right now; we need an API that plugins can
> implement to expose table operations.
>
>
>
> I've proposed that API, TableCatalog, and a way to manage catalog plugins.
> I've made an argument for why I think that API is flexible enough for the
> task and still fairly simple.
>
>
>
> I think that we can add TableCatalog now and work on multi-catalog support
> incrementally, and I have yet to hear your argument for why that is not the
> case.
>
>
>
> rb
>
>
>
> On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Ryan,
>
>
>
> Catalog is a really important component for Spark SQL or any analytics
> platform, I have to emphasize. Thus, a careful design is needed to ensure
> it works as expected. Based on my previous discussion with many community
> members, Spark SQL needs a catalog interface so that we can mount multiple
> external physical catalogs and they can be presented as a single logical
> catalog [which is a so-called global federated catalog]. In the future, we
> can use this interface to develop our own catalog (instead of Hive
> metastore) for more efficient metadata management. We can also plug in ACL
> management if needed.
>
>
>
> Based on your previous answers, it sounds like you have many ideas in your
> mind about building a Catalog interface for Spark SQL, but it is not shown
> in the design doc. Could you write them down in a single doc? We can try to
> leave comments in the design doc, instead of discussing various issues in
> PRs, emails and meetings. It can also help the whole community understand
> your proposal and post their comments.
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>
> Xiao,
>
> For the questions in this last email about how catalogs interact and how
> functions and other future features work: we discussed those last night. As
> I said then, I think that the right approach is incremental. We don’t want
> to design all of that in one gigantic proposal up front. To do that is to
> put ourselves into analysis paralysis.
>
> We don’t have a design for how catalogs interact with one another, but I
> think we made a strong case for two points: first, that the proposed
> structure doesn’t preclude any of those future decisions (hence we should
> proceed incrementally). Second, that those situations aren’t that hard to
> think through if you’re concerned about them: functions that can run in
> Spark can be run on any data, functions that run in external sources cannot
> be run on any data.
>
> You’re right that I haven’t completely covered your *new* questions. But
> to the questions in your first email:
>
> ·         You asked how, for example, Glue may be plugged in. That is
> well covered in the PR that adds catalogs as a plugin
> <https://github.com/apache/spark/pull/21306#issue-187572913>, the
> response I sent to Wenchen’s questions, and the earlier discussion thread I
> posted to this list with the subject “[DISCUSS] Multiple catalog support”.
> The short answer is that implementations are configured with Spark config
> properties and loaded with reflection.
>
> ·         You asked how users implement an external catalog without
> adding new data sources. That’s also covered in the “Multiple catalog
> support” proposal, the table catalog PR, and ongoing discussions on the v2
> redesign. The answer is that a catalog returns a table instance that
> implements the various interfaces from Wenchen’s work. A table may
> implement them directly or return other existing implementations. Here’s
> how it worked in the old API
> <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
> .
>
> I hope that you don’t think I expect you to go “without seeing the design”!
>
> rb
>
>
>
> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>
> Ryan,
>
>
>
> All the proposal I read is only related to Table metadata. Catalog
> contains the metadata of database, functions, columns, views, and so on.
> When we have multiple catalogs, how these catalogs interact with each
> other? How the global catalog works? How a view, table, function, database
> and column is resolved? Do we have nickname, mapping, wrapper?
>
>
>
> Or I might miss the design docs you send? Could you post the doc?
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>
> Xiao,
>
>
>
> Please have a look at the pull requests and documents I've posted over the
> last few months.
>
>
>
> If you still have questions about how you might plug in Glue, let me know
> and I can clarify.
>
>
>
> rb
>
>
>
> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>
> Ryan,
>
>
>
> Thanks for leading the discussion and sending out the memo!
>
>
>
> Xiao suggested that there are restrictions for how tables and functions
> interact. Because of this, he doesn’t think that separate TableCatalog and
> FunctionCatalog APIs are feasible.
>
>
>
> Anything is possible. It depends on how we design the two interfaces. Now,
> most parts are unknown to me without seeing the design.
>
>
>
> I think we need to see the user stories, and high-level design before
> working on a small portion of Catalog federation. We do not need an
> exhaustive design in the current stage, but we need to know how the new
> proposal works. For example, how to plug in a new Hive metastore? How to
> plug in a Glue? How do users implement a new external catalog without
> adding any new data sources? Without knowing more details, it is hard to
> say whether this TableCatalog can satisfy all the requirements.
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
>
>
> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>
> Hi everyone,
>
> Here are my notes from last night’s sync. Some attendees that joined
> during discussion may be missing, since I made the list while we were
> waiting for people to join.
>
> If you have topic suggestions for the next sync, please start sending them
> to me. Thank you!
>
> *Attendees:*
>
> Ryan Blue
> John Zhuge
> Jamison Bennett
> Yuanjian Li
> Xiao Li
> stczwd
> Matt Cheah
> Wenchen Fan
> Genglian Wang
> Kevin Yu
> Maryann Xue
> Cody Koeninger
> Bruce Robbins
> Rohit Karlupia
>
> *Agenda:*
>
> ·         Follow-up issues or discussion on Wenchen’s PR #23086
>
> ·         TableCatalog proposal
>
> ·         CatalogTableIdentifier
>
> *Notes:*
>
> ·         Discussion about PR #23086
>
> o    Where should the catalog API live since it needs to be accessible to
> catalyst rules, but the catalyst module is private?
>
> o    Wenchen suggested creating a sql-api module for v2 API interfaces,
> making catalyst depend on it
>
> o    Consensus was to use Wenchen’s suggestion
>
> ·         In discussion about #23086, Xiao asked how adding catalog to a
> table identifier will work
>
> o    Background from Ryan: existing code paths use TableIdentifier and
> don’t expect a catalog portion. If an identifier with a catalog were passed
> to existing code, that code may use the default catalog not knowing that a
> different one was requested, which would be incorrect behavior.
>
> o    Ryan: The proposal for CatalogTableIdentifier addresses this
> problem. TableIdentifier is used for identifiers that have no catalog set.
> By enforcing that requirement, passing a TableIdentifier to old code
> ensures that no catalogs leak into that code. This is also used when the
> catalog is set from context. For example, the TableCatalog API accepts only
> TableIdentifier because the catalog is already determined.
>
> ·         Xiao asked whether FunctionIdentifier needs to be updated in
> the same way as CatalogTableIdentifier.
>
> o    Ryan: Yes, when a FunctionCatalog API is added
>
> ·         The remaining time was spent discussing whether the plan to
> incrementally replace the current catalog API will work. [Not great notes
> here, feel free to add your take in a reply]
>
> o    Xiao suggested that there are restrictions for how tables and
> functions interact. Because of this, he doesn’t think that separate
> TableCatalog and FunctionCatalog APIs are feasible.
>
> o    Wenchen and Ryan think that functions should be orthogonal to data
> sources
>
> o    Matt and Ryan think that catalog design can be done incrementally as
> new interfaces (i.e. FunctionCatalog) are added and that the proposed
> TableCatalog does not preclude designing for Xiao’s concerns later
>
> o    [I forget who] pointed out that there are restrictions in some
> databases for views from different sources
>
> o    There was some discussion about when functions or views cannot be
> orthogonal. For example, where the code runs is important. Functions pushed
> to sources cannot necessarily be run on other sources and Spark functions
> cannot necessarily be pushed down to sources.
>
> o    Xiao would like a full catalog replacement design, including views,
> databases, and functions and how they interact, before moving forward with
> the proposed TableCatalog API
>
> o    Ryan [and Matt, I think] think that TableCatalog is compatible with
> future decisions and the best path forward is to build incrementally. An
> exhaustive design process blocks progress on v2.
>
>
>
> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>
> Hi everyone,
>
> I just sent out an invite for the next DSv2 community sync for Wednesday,
> 28 Nov at 5PM PST.
>
> We have a few topics left over from last time to cover. A few people
> wanted to cover catalog APIs, so I put two items on the agenda:
>
> ·         The TableCatalog proposal (and other catalog APIs)
>
> ·         Using CatalogTableIdentifier to separate v1 and v2 code paths
> and avoid unintended behavior changes
>
> As I noted in the summary last time, please send topics ahead of time so
> we can get started more quickly.
>
> If you would like to be added to the google hangout invite, please let me
> know and I’ll add you. Thanks!
>
> rb
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Jayesh,

The current catalog in Spark is a little weird. It uses a Hive catalog and
adds metadata that only Spark understands to track tables, in addition to
regular Hive tables. Some of those tables are actually just pointers to
tables that exist in some other source of truth. This is what makes the
default implementation “generic”.

The reason why Spark is this way is that it only supports one global
catalog. I want Spark to be able to use multiple catalogs, so I can plug in
a catalog implementation that talks directly to the source of truth for
another system, like Cassandra or JDBC. But Cassandra tracks Cassandra
tables and wouldn’t be a generic catalog.

I also think that the easiest way to expose multiple catalogs is to simply
let users specify which one they want to interact with:

SELECT * FROM jdbc_test.db.table;

USE CATALOG cassandra_prod;
SELECT * FROM some_c_table;

The public-facing user interaction part is still open for discussion and I
think Xiao has a different opinion.

The work I’ve posed solves two different problems:

   1. How should Spark interact with catalog implementations? For tables,
   I’ve proposed the TableCatalog API and I wrote an SPIP explaining why.
   2. How should Spark internally track tables in different catalogs? For
   this, I’ve proposed a CatalogTableIdentifier and outlined how to keep these
   out of paths that don’t expect tables in other catalogs.

Spark catalog be the common denominator of the other catalogs (least
featured) or a super-feature catalog?

I think anything we want to expose in Spark would need to be in an API
implemented by catalogs. I prefer an incremental approach here: as we build
things that Spark can take advantage of, we can add them to the API that
catalog implementations provide.

On Sat, Dec 1, 2018 at 9:10 PM Thakrar, Jayesh jthakrar@conversantmedia.com
<ht...@conversantmedia.com> wrote:

Just curious on the need for a catalog within Spark.
>
>
>
> So Spark interface with other systems – many of which have a catalog of
> their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS,
> filesyststem, etc).
>
> So what is the purpose of having this catalog within Spark for tables
> defined in Spark (which could be a front for other “catalogs”)?
>
> Is it trying to fulfill some void/need…..
>
> Also, would the Spark catalog be the common denominator of the other
> catalogs (least featured) or a super-feature catalog?
>
>
>
> *From: *Xiao Li <ga...@gmail.com>
> *Date: *Saturday, December 1, 2018 at 10:49 PM
> *To: *Ryan Blue <rb...@netflix.com>
> *Cc: *"user@spark.apache.org" <de...@spark.apache.org>
> *Subject: *Re: DataSourceV2 community sync #3
>
>
>
> Hi, Ryan,
>
>
>
> Let us first focus on answering the most fundamental problem before
> discussing various related topics. What is a catalog in Spark SQL?
>
>
>
> My definition of catalog is based on the database catalog. Basically, the
> catalog provides a service that manage the metadata/definitions of database
> objects (e.g., database, views, tables, functions, user roles, and so on).
>
>
>
> In Spark SQL, all the external objects accessed through our data source
> APIs are called "tables". I do not think we will expand the support in the
> near future. That means, the metadata we need from the external data
> sources are for table only.
>
>
>
> These data sources should not use the Catalog identifier to identify. That
> means, in "catalog.database.table", catalog is only used to identify the
> actual catalog instead of data sources.
>
>
>
> For a Spark cluster, we could mount multiple catalogs (e.g.,
> hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could
> get the metadata of the tables, database, functions by accessing different
> catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2",
> "glue.db3.tab2". In the future, if Spark has its own catalog
> implementation, we might have something like, "spark_catalog1.db3.tab2".
> The catalog will be used for registering all the external data sources,
> various Spark UDFs and so on.
>
>
>
> At the same time, we should NOT mix the table-level data sources with
> catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1",
> "Hbase3.db1.tab2" will not appear.
>
>
>
> Do you agree on my definition of catalog in Spark SQL?
>
>
>
> Xiao
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午7:25写道：
>
> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
> I’ve tried to open discussions on this for the last 6+ months because we
> need it. I understand that you’d like a comprehensive plan for supporting
> more than one catalog before moving forward, but I think most of us are
> okay with the incremental approach. It’s better to make progress toward the
> goal.
>
> In general, data source API V2 and catalog API should be orthogonal
> I agree with you, and they are. The API that Wenchen is working on for
> reading and writing data and the TableCatalog API are orthogonal efforts.
> As I said, they only overlap with the Table interface, and clearly tables
> loaded from a catalog need to be able to plug into the read/write API.
>
> The reason these two efforts are related is that the community voted to standardize
> logical plans
> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
> Those standard plans have well-defined behavior for operations like CTAS,
> instead of relying on the data source plugin to do … something undefined.
> To implement this, we need a way for Spark to create tables, drop tables,
> etc. That’s why we need a way for sources to plug in Table-related catalog
> operations. (Sorry if this was already clear. I know I talked about it at
> length in the first v2 sync up.)
>
> While the two APIs are orthogonal and serve different purposes,
> implementing common operations requires that we have both.
>
> I would not call it a table catalog. I do not expect the data source
> should/need to implement a catalog. Since you might want an atomic CTAS, we
> can improve the table metadata resolution logic to support it with
> different resolution priorities. For example, try to get the metadata from
> the external data source, if the table metadata is not available in the
> catalog.
>
> It sounds like your definition of a “catalog” is different. I think you’re
> referring to a global catalog? Could you explain what you’re talking about
> here?
>
> I’m talking about an API to interface with an external data source, which
> I think we need for the reasons I outlined above. I don’t care what we call
> it, but your comment seems to hint that there would be an API to look up
> tables in external sources. That’s the thing I’m talking about.
>
> CatalogTableIdentifier: The PR is doing nothing but adding an interface.
>
> Yes. I opened this PR to discuss how Spark should track tables from
> different catalogs and avoid passing those references to code paths that
> don’t support them. The use of table identifiers with a catalog part was
> discussed in the “Multiple catalog support” thread. I’ve also brought it up
> and pointed out how I think it should be used in syncs a couple of times.
>
> Sorry if this discussion isn’t how you would have done it, but it’s a
> fairly simple idea that I don’t think requires its own doc.
>
>
>
> On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Ryan,
>
>
>
> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
>
>
> -  *CatalogTableIdentifier*: The PR
> https://github.com/apache/spark/pull/21978 is doing nothing but adding an
> interface. In the PR, we did not discuss how to resolve it, any restriction
> on the naming and what is a catalog.This requires more doc for explaining
> it. For example,
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
> Normally, we do not merge a PR without showing how to use it.
>
>
>
> - *TableCatalog*: First, I would not call it a table catalog. I do not
> expect the data source should/need to implement a catalog. Since you might
> want an atomic CTAS, we can improve the table metadata resolution logic to
> support it with different resolution priorities. For example, try to get
> the metadata from the external data source, if the table metadata is not
> available in the catalog. However, the catalog should do what the catalog
> is expected to do. If we follow what our data source API V2 is doing,
> basically, the data source is just a table. It is not related to database,
> view, or function. Mixing catalog with data source API V2 just makes the
> whole things more complex.
>
>
>
> In general, data source API V2 and catalog API should be orthogonal. I
> believe the data source API V2 and catalog APIs are two separate projects.
> Hopefully, you understand my concern. If we really want to mix them
> together, I want to read the design of your multi-catalog support and
> understand more details.
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午3:22写道：
>
> Xiao,
>
>
>
> I do have opinions about how multi-catalog support should work, but I
> don't think we are at a point where there is consensus. That's why I've
> started discussion threads and added the CatalogTableIdentifier PR instead
> of a comprehensive design doc. You have opinions about how users should
> interact with catalogs as well (your "federated catalog") and we should
> discuss our options here.
>
>
>
> But the crucial point is that the user interaction doesn't need to be
> completely decided in order to move forward. A design for multi-catalog
> support isn't what we need right now; we need an API that plugins can
> implement to expose table operations.
>
>
>
> I've proposed that API, TableCatalog, and a way to manage catalog plugins.
> I've made an argument for why I think that API is flexible enough for the
> task and still fairly simple.
>
>
>
> I think that we can add TableCatalog now and work on multi-catalog support
> incrementally, and I have yet to hear your argument for why that is not the
> case.
>
>
>
> rb
>
>
>
> On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Ryan,
>
>
>
> Catalog is a really important component for Spark SQL or any analytics
> platform, I have to emphasize. Thus, a careful design is needed to ensure
> it works as expected. Based on my previous discussion with many community
> members, Spark SQL needs a catalog interface so that we can mount multiple
> external physical catalogs and they can be presented as a single logical
> catalog [which is a so-called global federated catalog]. In the future, we
> can use this interface to develop our own catalog (instead of Hive
> metastore) for more efficient metadata management. We can also plug in ACL
> management if needed.
>
>
>
> Based on your previous answers, it sounds like you have many ideas in your
> mind about building a Catalog interface for Spark SQL, but it is not shown
> in the design doc. Could you write them down in a single doc? We can try to
> leave comments in the design doc, instead of discussing various issues in
> PRs, emails and meetings. It can also help the whole community understand
> your proposal and post their comments.
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>
> Xiao,
>
> For the questions in this last email about how catalogs interact and how
> functions and other future features work: we discussed those last night. As
> I said then, I think that the right approach is incremental. We don’t want
> to design all of that in one gigantic proposal up front. To do that is to
> put ourselves into analysis paralysis.
>
> We don’t have a design for how catalogs interact with one another, but I
> think we made a strong case for two points: first, that the proposed
> structure doesn’t preclude any of those future decisions (hence we should
> proceed incrementally). Second, that those situations aren’t that hard to
> think through if you’re concerned about them: functions that can run in
> Spark can be run on any data, functions that run in external sources cannot
> be run on any data.
>
> You’re right that I haven’t completely covered your *new* questions. But
> to the questions in your first email:
>
> ·         You asked how, for example, Glue may be plugged in. That is
> well covered in the PR that adds catalogs as a plugin
> <https://github.com/apache/spark/pull/21306#issue-187572913>, the
> response I sent to Wenchen’s questions, and the earlier discussion thread I
> posted to this list with the subject “[DISCUSS] Multiple catalog support”.
> The short answer is that implementations are configured with Spark config
> properties and loaded with reflection.
>
> ·         You asked how users implement an external catalog without
> adding new data sources. That’s also covered in the “Multiple catalog
> support” proposal, the table catalog PR, and ongoing discussions on the v2
> redesign. The answer is that a catalog returns a table instance that
> implements the various interfaces from Wenchen’s work. A table may
> implement them directly or return other existing implementations. Here’s
> how it worked in the old API
> <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
> .
>
> I hope that you don’t think I expect you to go “without seeing the design”!
>
> rb
>
>
>
> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>
> Ryan,
>
>
>
> All the proposal I read is only related to Table metadata. Catalog
> contains the metadata of database, functions, columns, views, and so on.
> When we have multiple catalogs, how these catalogs interact with each
> other? How the global catalog works? How a view, table, function, database
> and column is resolved? Do we have nickname, mapping, wrapper?
>
>
>
> Or I might miss the design docs you send? Could you post the doc?
>
>
>
> Thanks,
>
>
>
> Xiao
>
>
>
>
>
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>
> Xiao,
>
>
>
> Please have a look at the pull requests and documents I've posted over the
> last few months.
>
>
>
> If you still have questions about how you might plug in Glue, let me know
> and I can clarify.
>
>
>
> rb
>
>
>
> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>
> Ryan,
>
>
>
> Thanks for leading the discussion and sending out the memo!
>
>
>
> Xiao suggested that there are restrictions for how tables and functions
> interact. Because of this, he doesn’t think that separate TableCatalog and
> FunctionCatalog APIs are feasible.
>
>
>
> Anything is possible. It depends on how we design the two interfaces. Now,
> most parts are unknown to me without seeing the design.
>
>
>
> I think we need to see the user stories, and high-level design before
> working on a small portion of Catalog federation. We do not need an
> exhaustive design in the current stage, but we need to know how the new
> proposal works. For example, how to plug in a new Hive metastore? How to
> plug in a Glue? How do users implement a new external catalog without
> adding any new data sources? Without knowing more details, it is hard to
> say whether this TableCatalog can satisfy all the requirements.
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
>
>
> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>
> Hi everyone,
>
> Here are my notes from last night’s sync. Some attendees that joined
> during discussion may be missing, since I made the list while we were
> waiting for people to join.
>
> If you have topic suggestions for the next sync, please start sending them
> to me. Thank you!
>
> *Attendees:*
>
> Ryan Blue
> John Zhuge
> Jamison Bennett
> Yuanjian Li
> Xiao Li
> stczwd
> Matt Cheah
> Wenchen Fan
> Genglian Wang
> Kevin Yu
> Maryann Xue
> Cody Koeninger
> Bruce Robbins
> Rohit Karlupia
>
> *Agenda:*
>
> ·         Follow-up issues or discussion on Wenchen’s PR #23086
>
> ·         TableCatalog proposal
>
> ·         CatalogTableIdentifier
>
> *Notes:*
>
> ·         Discussion about PR #23086
>
> o    Where should the catalog API live since it needs to be accessible to
> catalyst rules, but the catalyst module is private?
>
> o    Wenchen suggested creating a sql-api module for v2 API interfaces,
> making catalyst depend on it
>
> o    Consensus was to use Wenchen’s suggestion
>
> ·         In discussion about #23086, Xiao asked how adding catalog to a
> table identifier will work
>
> o    Background from Ryan: existing code paths use TableIdentifier and
> don’t expect a catalog portion. If an identifier with a catalog were passed
> to existing code, that code may use the default catalog not knowing that a
> different one was requested, which would be incorrect behavior.
>
> o    Ryan: The proposal for CatalogTableIdentifier addresses this
> problem. TableIdentifier is used for identifiers that have no catalog set.
> By enforcing that requirement, passing a TableIdentifier to old code
> ensures that no catalogs leak into that code. This is also used when the
> catalog is set from context. For example, the TableCatalog API accepts only
> TableIdentifier because the catalog is already determined.
>
> ·         Xiao asked whether FunctionIdentifier needs to be updated in
> the same way as CatalogTableIdentifier.
>
> o    Ryan: Yes, when a FunctionCatalog API is added
>
> ·         The remaining time was spent discussing whether the plan to
> incrementally replace the current catalog API will work. [Not great notes
> here, feel free to add your take in a reply]
>
> o    Xiao suggested that there are restrictions for how tables and
> functions interact. Because of this, he doesn’t think that separate
> TableCatalog and FunctionCatalog APIs are feasible.
>
> o    Wenchen and Ryan think that functions should be orthogonal to data
> sources
>
> o    Matt and Ryan think that catalog design can be done incrementally as
> new interfaces (i.e. FunctionCatalog) are added and that the proposed
> TableCatalog does not preclude designing for Xiao’s concerns later
>
> o    [I forget who] pointed out that there are restrictions in some
> databases for views from different sources
>
> o    There was some discussion about when functions or views cannot be
> orthogonal. For example, where the code runs is important. Functions pushed
> to sources cannot necessarily be run on other sources and Spark functions
> cannot necessarily be pushed down to sources.
>
> o    Xiao would like a full catalog replacement design, including views,
> databases, and functions and how they interact, before moving forward with
> the proposed TableCatalog API
>
> o    Ryan [and Matt, I think] think that TableCatalog is compatible with
> future decisions and the best path forward is to build incrementally. An
> exhaustive design process blocks progress on v2.
>
>
>
> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>
> Hi everyone,
>
> I just sent out an invite for the next DSv2 community sync for Wednesday,
> 28 Nov at 5PM PST.
>
> We have a few topics left over from last time to cover. A few people
> wanted to cover catalog APIs, so I put two items on the agenda:
>
> ·         The TableCatalog proposal (and other catalog APIs)
>
> ·         Using CatalogTableIdentifier to separate v1 and v2 code paths
> and avoid unintended behavior changes
>
> As I noted in the summary last time, please send topics ahead of time so
> we can get started more quickly.
>
> If you would like to be added to the google hangout invite, please let me
> know and I’ll add you. Thanks!
>
> rb
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
> --
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

Just curious on the need for a catalog within Spark.

So Spark interface with other systems – many of which have a catalog of their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS, filesyststem, etc).
So what is the purpose of having this catalog within Spark for tables defined in Spark (which could be a front for other “catalogs”)?
Is it trying to fulfill some void/need…..
Also, would the Spark catalog be the common denominator of the other catalogs (least featured) or a super-feature catalog?

From: Xiao Li <ga...@gmail.com>
Date: Saturday, December 1, 2018 at 10:49 PM
To: Ryan Blue <rb...@netflix.com>
Cc: "user@spark.apache.org" <de...@spark.apache.org>
Subject: Re: DataSourceV2 community sync #3

Hi, Ryan,

Let us first focus on answering the most fundamental problem before discussing various related topics. What is a catalog in Spark SQL?

My definition of catalog is based on the database catalog. Basically, the catalog provides a service that manage the metadata/definitions of database objects (e.g., database, views, tables, functions, user roles, and so on).

In Spark SQL, all the external objects accessed through our data source APIs are called "tables". I do not think we will expand the support in the near future. That means, the metadata we need from the external data sources are for table only.

These data sources should not use the Catalog identifier to identify. That means, in "catalog.database.table", catalog is only used to identify the actual catalog instead of data sources.

For a Spark cluster, we could mount multiple catalogs (e.g., hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could get the metadata of the tables, database, functions by accessing different catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2", "glue.db3.tab2". In the future, if Spark has its own catalog implementation, we might have something like, "spark_catalog1.db3.tab2". The catalog will be used for registering all the external data sources, various Spark UDFs and so on.

At the same time, we should NOT mix the table-level data sources with catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1", "Hbase3.db1.tab2" will not appear.

Do you agree on my definition of catalog in Spark SQL?

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年12月1日周六 下午7:25写道：

I try to avoid discussing each specific topic about the catalog federation before we deciding the framework of multi-catalog supports.

I’ve tried to open discussions on this for the last 6+ months because we need it. I understand that you’d like a comprehensive plan for supporting more than one catalog before moving forward, but I think most of us are okay with the incremental approach. It’s better to make progress toward the goal.

In general, data source API V2 and catalog API should be orthogonal
I agree with you, and they are. The API that Wenchen is working on for reading and writing data and the TableCatalog API are orthogonal efforts. As I said, they only overlap with the Table interface, and clearly tables loaded from a catalog need to be able to plug into the read/write API.

The reason these two efforts are related is that the community voted to standardize logical plans<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. Those standard plans have well-defined behavior for operations like CTAS, instead of relying on the data source plugin to do … something undefined. To implement this, we need a way for Spark to create tables, drop tables, etc. That’s why we need a way for sources to plug in Table-related catalog operations. (Sorry if this was already clear. I know I talked about it at length in the first v2 sync up.)

While the two APIs are orthogonal and serve different purposes, implementing common operations requires that we have both.

I would not call it a table catalog. I do not expect the data source should/need to implement a catalog. Since you might want an atomic CTAS, we can improve the table metadata resolution logic to support it with different resolution priorities. For example, try to get the metadata from the external data source, if the table metadata is not available in the catalog.

It sounds like your definition of a “catalog” is different. I think you’re referring to a global catalog? Could you explain what you’re talking about here?

I’m talking about an API to interface with an external data source, which I think we need for the reasons I outlined above. I don’t care what we call it, but your comment seems to hint that there would be an API to look up tables in external sources. That’s the thing I’m talking about.

CatalogTableIdentifier: The PR is doing nothing but adding an interface.

Yes. I opened this PR to discuss how Spark should track tables from different catalogs and avoid passing those references to code paths that don’t support them. The use of table identifiers with a catalog part was discussed in the “Multiple catalog support” thread. I’ve also brought it up and pointed out how I think it should be used in syncs a couple of times.

Sorry if this discussion isn’t how you would have done it, but it’s a fairly simple idea that I don’t think requires its own doc.

On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com>> wrote:
Hi, Ryan,

I try to avoid discussing each specific topic about the catalog federation before we deciding the framework of multi-catalog supports.

-  CatalogTableIdentifier: The PR https://github.com/apache/spark/pull/21978 is doing nothing but adding an interface. In the PR, we did not discuss how to resolve it, any restriction on the naming and what is a catalog.This requires more doc for explaining it. For example, https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017 Normally, we do not merge a PR without showing how to use it.

- TableCatalog: First, I would not call it a table catalog. I do not expect the data source should/need to implement a catalog. Since you might want an atomic CTAS, we can improve the table metadata resolution logic to support it with different resolution priorities. For example, try to get the metadata from the external data source, if the table metadata is not available in the catalog. However, the catalog should do what the catalog is expected to do. If we follow what our data source API V2 is doing, basically, the data source is just a table. It is not related to database, view, or function. Mixing catalog with data source API V2 just makes the whole things more complex.

In general, data source API V2 and catalog API should be orthogonal. I believe the data source API V2 and catalog APIs are two separate projects. Hopefully, you understand my concern. If we really want to mix them together, I want to read the design of your multi-catalog support and understand more details.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年12月1日周六 下午3:22写道：
Xiao,

I do have opinions about how multi-catalog support should work, but I don't think we are at a point where there is consensus. That's why I've started discussion threads and added the CatalogTableIdentifier PR instead of a comprehensive design doc. You have opinions about how users should interact with catalogs as well (your "federated catalog") and we should discuss our options here.

But the crucial point is that the user interaction doesn't need to be completely decided in order to move forward. A design for multi-catalog support isn't what we need right now; we need an API that plugins can implement to expose table operations.

I've proposed that API, TableCatalog, and a way to manage catalog plugins. I've made an argument for why I think that API is flexible enough for the task and still fairly simple.

I think that we can add TableCatalog now and work on multi-catalog support incrementally, and I have yet to hear your argument for why that is not the case.

rb

On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com>> wrote:
Hi, Ryan,

Catalog is a really important component for Spark SQL or any analytics platform, I have to emphasize. Thus, a careful design is needed to ensure it works as expected. Based on my previous discussion with many community members, Spark SQL needs a catalog interface so that we can mount multiple external physical catalogs and they can be presented as a single logical catalog [which is a so-called global federated catalog]. In the future, we can use this interface to develop our own catalog (instead of Hive metastore) for more efficient metadata management. We can also plug in ACL management if needed.

Based on your previous answers, it sounds like you have many ideas in your mind about building a Catalog interface for Spark SQL, but it is not shown in the design doc. Could you write them down in a single doc? We can try to leave comments in the design doc, instead of discussing various issues in PRs, emails and meetings. It can also help the whole community understand your proposal and post their comments.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年11月29日周四 下午7:06写道：

Xiao,

For the questions in this last email about how catalogs interact and how functions and other future features work: we discussed those last night. As I said then, I think that the right approach is incremental. We don’t want to design all of that in one gigantic proposal up front. To do that is to put ourselves into analysis paralysis.

We don’t have a design for how catalogs interact with one another, but I think we made a strong case for two points: first, that the proposed structure doesn’t preclude any of those future decisions (hence we should proceed incrementally). Second, that those situations aren’t that hard to think through if you’re concerned about them: functions that can run in Spark can be run on any data, functions that run in external sources cannot be run on any data.

You’re right that I haven’t completely covered your new questions. But to the questions in your first email:
·         You asked how, for example, Glue may be plugged in. That is well covered in the PR that adds catalogs as a plugin<https://github.com/apache/spark/pull/21306#issue-187572913>, the response I sent to Wenchen’s questions, and the earlier discussion thread I posted to this list with the subject “[DISCUSS] Multiple catalog support”. The short answer is that implementations are configured with Spark config properties and loaded with reflection.
·         You asked how users implement an external catalog without adding new data sources. That’s also covered in the “Multiple catalog support” proposal, the table catalog PR, and ongoing discussions on the v2 redesign. The answer is that a catalog returns a table instance that implements the various interfaces from Wenchen’s work. A table may implement them directly or return other existing implementations. Here’s how it worked in the old API<https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>.

I hope that you don’t think I expect you to go “without seeing the design”!

rb

On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com>> wrote:
Ryan,

All the proposal I read is only related to Table metadata. Catalog contains the metadata of database, functions, columns, views, and so on. When we have multiple catalogs, how these catalogs interact with each other? How the global catalog works? How a view, table, function, database and column is resolved? Do we have nickname, mapping, wrapper?

Or I might miss the design docs you send? Could you post the doc?

Thanks,

Xiao

Ryan Blue <rb...@netflix.com>> 于2018年11月29日周四 下午3:06写道：
Xiao,

Please have a look at the pull requests and documents I've posted over the last few months.

If you still have questions about how you might plug in Glue, let me know and I can clarify.

rb

On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com>> wrote:
Ryan,

Thanks for leading the discussion and sending out the memo!

Xiao suggested that there are restrictions for how tables and functions interact. Because of this, he doesn’t think that separate TableCatalog and FunctionCatalog APIs are feasible.

Anything is possible. It depends on how we design the two interfaces. Now, most parts are unknown to me without seeing the design.

I think we need to see the user stories, and high-level design before working on a small portion of Catalog federation. We do not need an exhaustive design in the current stage, but we need to know how the new proposal works. For example, how to plug in a new Hive metastore? How to plug in a Glue? How do users implement a new external catalog without adding any new data sources? Without knowing more details, it is hard to say whether this TableCatalog can satisfy all the requirements.

Cheers,

Xiao

Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：

Hi everyone,

Here are my notes from last night’s sync. Some attendees that joined during discussion may be missing, since I made the list while we were waiting for people to join.

If you have topic suggestions for the next sync, please start sending them to me. Thank you!

Attendees:

Ryan Blue
John Zhuge
Jamison Bennett
Yuanjian Li
Xiao Li
stczwd
Matt Cheah
Wenchen Fan
Genglian Wang
Kevin Yu
Maryann Xue
Cody Koeninger
Bruce Robbins
Rohit Karlupia

Agenda:
·         Follow-up issues or discussion on Wenchen’s PR #23086
·         TableCatalog proposal
·         CatalogTableIdentifier

Notes:
·         Discussion about PR #23086
o    Where should the catalog API live since it needs to be accessible to catalyst rules, but the catalyst module is private?
o    Wenchen suggested creating a sql-api module for v2 API interfaces, making catalyst depend on it
o    Consensus was to use Wenchen’s suggestion
·         In discussion about #23086, Xiao asked how adding catalog to a table identifier will work
o    Background from Ryan: existing code paths use TableIdentifier and don’t expect a catalog portion. If an identifier with a catalog were passed to existing code, that code may use the default catalog not knowing that a different one was requested, which would be incorrect behavior.
o    Ryan: The proposal for CatalogTableIdentifier addresses this problem. TableIdentifier is used for identifiers that have no catalog set. By enforcing that requirement, passing a TableIdentifier to old code ensures that no catalogs leak into that code. This is also used when the catalog is set from context. For example, the TableCatalog API accepts only TableIdentifier because the catalog is already determined.
·         Xiao asked whether FunctionIdentifier needs to be updated in the same way as CatalogTableIdentifier.
o    Ryan: Yes, when a FunctionCatalog API is added
·         The remaining time was spent discussing whether the plan to incrementally replace the current catalog API will work. [Not great notes here, feel free to add your take in a reply]
o    Xiao suggested that there are restrictions for how tables and functions interact. Because of this, he doesn’t think that separate TableCatalog and FunctionCatalog APIs are feasible.
o    Wenchen and Ryan think that functions should be orthogonal to data sources
o    Matt and Ryan think that catalog design can be done incrementally as new interfaces (i.e. FunctionCatalog) are added and that the proposed TableCatalog does not preclude designing for Xiao’s concerns later
o    [I forget who] pointed out that there are restrictions in some databases for views from different sources
o    There was some discussion about when functions or views cannot be orthogonal. For example, where the code runs is important. Functions pushed to sources cannot necessarily be run on other sources and Spark functions cannot necessarily be pushed down to sources.
o    Xiao would like a full catalog replacement design, including views, databases, and functions and how they interact, before moving forward with the proposed TableCatalog API
o    Ryan [and Matt, I think] think that TableCatalog is compatible with future decisions and the best path forward is to build incrementally. An exhaustive design process blocks progress on v2.

On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com>> wrote:

Hi everyone,

I just sent out an invite for the next DSv2 community sync for Wednesday, 28 Nov at 5PM PST.

We have a few topics left over from last time to cover. A few people wanted to cover catalog APIs, so I put two items on the agenda:
·         The TableCatalog proposal (and other catalog APIs)
·         Using CatalogTableIdentifier to separate v1 and v2 code paths and avoid unintended behavior changes

As I noted in the summary last time, please send topics ahead of time so we can get started more quickly.

If you would like to be added to the google hangout invite, please let me know and I’ll add you. Thanks!

rb
--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Do you agree on my definition of catalog in Spark SQL?

I think we agree on what a catalog is: A service that can manage the
metadata and definitions of databases, views, tables, functions, roles, etc.

external objects accessed through our data source APIs are called “tables”.
I do not think we will expand the support in the near future. That means,
the metadata we need from the external data sources are for table only

Are you saying that there is no need to for a catalog plugin to expose
non-table objects like functions?

If so, I don’t think that’s correct. There are a couple of reasons why we
want to allow catalog plugins to expose functions:

   1. I have an interest in making it easy to plug in UDF libraries
   2. So that we can move the current session catalog’s behavior behind the
   new API

That second point assumes that we want to use some internal API to expose
catalogs to Spark, and eventually want to it for all catalog operations.
Based on our previous conversations, I think we’re in agreement there? If
we introduce a catalog API so Spark can use multiple catalogs, all catalogs
should eventually use it, right?

You may have meant that some catalogs will only need to expose tables and
not things like views or functions. If that’s what you meant then I agree.
That’s one of the reasons why the proposed API for plugging in a catalog
covers only tables. The idea is that a catalog can implement multiple
interfaces to expose more objects to Spark, like TableCatalog,
FunctionCatalog, etc.

For a Spark cluster, we could mount multiple catalogs (e.g.,
hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could
get the metadata of the tables, database, functions by accessing different
catalog: “hive_metastore_1.db1.tab1”, “hive_metastore_2.db2.tab2”,
“glue.db3.tab2”. In the future, if Spark has its own catalog
implementation, we might have something like, “spark_catalog1.db3.tab2”

I agree with this.

Being able to address tables in different catalogs is the purpose of the
CatalogTableIdentifier PR. In that work, I propose a way to add the catalog
portion, but guarantee that classes that expect an identifier without a
source are guaranteed to do the right thing. The idea is to continue to use
TableIdentifier in those code paths, and in any new code path where the
catalog is determined (like passing an identifier to a catalog — that would
be a TableIdentifier).

The catalog will be used for registering all the external data sources,
various Spark UDFs and so on.

I’m not sure what you mean by “external data source” here. I think you mean
data source, like the “parquet” or “jdbc” sources. If so, that statement
doesn’t make sense in the context of what we’ve been working toward in the
v2 source (read/write) API.

The read/write API depends on a Table that exposes newScanBuilder (I’ll use
just the read side here). Whatever loaded the table is what returns the
implementation.

The component loading a table could be a “generic” catalog like we have
today that stores a data source name like “parquet”. That would load the
source like v1 and use the TableProvider API to instantiate a Table.

The loader may also be a “specific” catalog that can load one or only a few
implementations. For example, a catalog plugin that connects to Cassandra
would load only Cassandra tables and should not be expected to track
file-based tables.

A catalog used to “register” external sources sounds like the generic
catalog. But we need the ability to use specific catalogs as well, to do
real operations on external sources. Right now, the generic catalog we have
only allows you to track metadata about external tables. It doesn’t allow
you to interact with other catalogs that are the source of truth. We need
to be able to interact with those sources of truth. For example, to create
a table.

we should NOT mix the table-level data sources with catalog support. That
means, “Cassandra1.db1.tab1”, “Kafka2.db2.tab1”, “Hbase3.db1.tab2” will not
appear.

I’m not sure what you mean. Where would “cassandra1.db1.tab1 not appear?

I agree that catalogs and sources should be orthogonal. Like I said,
catalog plugins expose tables and those tables can be read using Wenchen’s
read/write API.

Do you mean that catalogs shouldn’t load tables?

On Sat, Dec 1, 2018 at 8:49 PM Xiao Li <ga...@gmail.com> wrote:

Hi, Ryan,
>
> Let us first focus on answering the most fundamental problem before
> discussing various related topics. What is a catalog in Spark SQL?
>
> My definition of catalog is based on the database catalog. Basically, the
> catalog provides a service that manage the metadata/definitions of database
> objects (e.g., database, views, tables, functions, user roles, and so on).
>
> In Spark SQL, all the external objects accessed through our data source
> APIs are called "tables". I do not think we will expand the support in the
> near future. That means, the metadata we need from the external data
> sources are for table only.
>
> These data sources should not use the Catalog identifier to identify. That
> means, in "catalog.database.table", catalog is only used to identify the
> actual catalog instead of data sources.
>
> For a Spark cluster, we could mount multiple catalogs (e.g.,
> hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could
> get the metadata of the tables, database, functions by accessing different
> catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2",
> "glue.db3.tab2". In the future, if Spark has its own catalog
> implementation, we might have something like, "spark_catalog1.db3.tab2".
> The catalog will be used for registering all the external data sources,
> various Spark UDFs and so on.
>
> At the same time, we should NOT mix the table-level data sources with
> catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1",
> "Hbase3.db1.tab2" will not appear.
>
> Do you agree on my definition of catalog in Spark SQL?
>
> Xiao
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午7:25写道：
>
>> I try to avoid discussing each specific topic about the catalog
>> federation before we deciding the framework of multi-catalog supports.
>>
>> I’ve tried to open discussions on this for the last 6+ months because we
>> need it. I understand that you’d like a comprehensive plan for supporting
>> more than one catalog before moving forward, but I think most of us are
>> okay with the incremental approach. It’s better to make progress toward the
>> goal.
>>
>> In general, data source API V2 and catalog API should be orthogonal
>> I agree with you, and they are. The API that Wenchen is working on for
>> reading and writing data and the TableCatalog API are orthogonal efforts.
>> As I said, they only overlap with the Table interface, and clearly tables
>> loaded from a catalog need to be able to plug into the read/write API.
>>
>> The reason these two efforts are related is that the community voted to standardize
>> logical plans
>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>> Those standard plans have well-defined behavior for operations like CTAS,
>> instead of relying on the data source plugin to do … something undefined.
>> To implement this, we need a way for Spark to create tables, drop tables,
>> etc. That’s why we need a way for sources to plug in Table-related catalog
>> operations. (Sorry if this was already clear. I know I talked about it at
>> length in the first v2 sync up.)
>>
>> While the two APIs are orthogonal and serve different purposes,
>> implementing common operations requires that we have both.
>>
>> I would not call it a table catalog. I do not expect the data source
>> should/need to implement a catalog. Since you might want an atomic CTAS, we
>> can improve the table metadata resolution logic to support it with
>> different resolution priorities. For example, try to get the metadata from
>> the external data source, if the table metadata is not available in the
>> catalog.
>>
>> It sounds like your definition of a “catalog” is different. I think
>> you’re referring to a global catalog? Could you explain what you’re talking
>> about here?
>>
>> I’m talking about an API to interface with an external data source, which
>> I think we need for the reasons I outlined above. I don’t care what we call
>> it, but your comment seems to hint that there would be an API to look up
>> tables in external sources. That’s the thing I’m talking about.
>>
>> CatalogTableIdentifier: The PR is doing nothing but adding an interface.
>>
>> Yes. I opened this PR to discuss how Spark should track tables from
>> different catalogs and avoid passing those references to code paths that
>> don’t support them. The use of table identifiers with a catalog part was
>> discussed in the “Multiple catalog support” thread. I’ve also brought it up
>> and pointed out how I think it should be used in syncs a couple of times.
>>
>> Sorry if this discussion isn’t how you would have done it, but it’s a
>> fairly simple idea that I don’t think requires its own doc.
>>
>> On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> Hi, Ryan,
>>>
>>> I try to avoid discussing each specific topic about the catalog
>>> federation before we deciding the framework of multi-catalog supports.
>>>
>>> -  *CatalogTableIdentifier*: The PR
>>> https://github.com/apache/spark/pull/21978 is doing nothing but adding
>>> an interface. In the PR, we did not discuss how to resolve it, any
>>> restriction on the naming and what is a catalog.This requires more doc for
>>> explaining it. For example,
>>> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
>>> Normally, we do not merge a PR without showing how to use it.
>>>
>>> - *TableCatalog*: First, I would not call it a table catalog. I do not
>>> expect the data source should/need to implement a catalog. Since you might
>>> want an atomic CTAS, we can improve the table metadata resolution logic to
>>> support it with different resolution priorities. For example, try to get
>>> the metadata from the external data source, if the table metadata is not
>>> available in the catalog. However, the catalog should do what the catalog
>>> is expected to do. If we follow what our data source API V2 is doing,
>>> basically, the data source is just a table. It is not related to database,
>>> view, or function. Mixing catalog with data source API V2 just makes the
>>> whole things more complex.
>>>
>>> In general, data source API V2 and catalog API should be orthogonal. I
>>> believe the data source API V2 and catalog APIs are two separate projects.
>>> Hopefully, you understand my concern. If we really want to mix them
>>> together, I want to read the design of your multi-catalog support and
>>> understand more details.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午3:22写道：
>>>
>>>> Xiao,
>>>>
>>>> I do have opinions about how multi-catalog support should work, but I
>>>> don't think we are at a point where there is consensus. That's why I've
>>>> started discussion threads and added the CatalogTableIdentifier PR instead
>>>> of a comprehensive design doc. You have opinions about how users should
>>>> interact with catalogs as well (your "federated catalog") and we should
>>>> discuss our options here.
>>>>
>>>> But the crucial point is that the user interaction doesn't need to be
>>>> completely decided in order to move forward. A design for multi-catalog
>>>> support isn't what we need right now; we need an API that plugins can
>>>> implement to expose table operations.
>>>>
>>>> I've proposed that API, TableCatalog, and a way to manage catalog
>>>> plugins. I've made an argument for why I think that API is flexible enough
>>>> for the task and still fairly simple.
>>>>
>>>> I think that we can add TableCatalog now and work on multi-catalog
>>>> support incrementally, and I have yet to hear your argument for why that is
>>>> not the case.
>>>>
>>>> rb
>>>>
>>>> On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>>> Hi, Ryan,
>>>>>
>>>>> Catalog is a really important component for Spark SQL or any analytics
>>>>> platform, I have to emphasize. Thus, a careful design is needed to ensure
>>>>> it works as expected. Based on my previous discussion with many community
>>>>> members, Spark SQL needs a catalog interface so that we can mount multiple
>>>>> external physical catalogs and they can be presented as a single logical
>>>>> catalog [which is a so-called global federated catalog]. In the future, we
>>>>> can use this interface to develop our own catalog (instead of Hive
>>>>> metastore) for more efficient metadata management. We can also plug in ACL
>>>>> management if needed.
>>>>>
>>>>> Based on your previous answers, it sounds like you have many ideas in
>>>>> your mind about building a Catalog interface for Spark SQL, but it is not
>>>>> shown in the design doc. Could you write them down in a single doc? We can
>>>>> try to leave comments in the design doc, instead of discussing various
>>>>> issues in PRs, emails and meetings. It can also help the whole community
>>>>> understand your proposal and post their comments.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>>
>>>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>>>>>
>>>>>> Xiao,
>>>>>>
>>>>>> For the questions in this last email about how catalogs interact and
>>>>>> how functions and other future features work: we discussed those last
>>>>>> night. As I said then, I think that the right approach is incremental. We
>>>>>> don’t want to design all of that in one gigantic proposal up front. To do
>>>>>> that is to put ourselves into analysis paralysis.
>>>>>>
>>>>>> We don’t have a design for how catalogs interact with one another,
>>>>>> but I think we made a strong case for two points: first, that the proposed
>>>>>> structure doesn’t preclude any of those future decisions (hence we should
>>>>>> proceed incrementally). Second, that those situations aren’t that hard to
>>>>>> think through if you’re concerned about them: functions that can run in
>>>>>> Spark can be run on any data, functions that run in external sources cannot
>>>>>> be run on any data.
>>>>>>
>>>>>> You’re right that I haven’t completely covered your *new* questions.
>>>>>> But to the questions in your first email:
>>>>>>
>>>>>>    - You asked how, for example, Glue may be plugged in. That is
>>>>>>    well covered in the PR that adds catalogs as a plugin
>>>>>>    <https://github.com/apache/spark/pull/21306#issue-187572913>, the
>>>>>>    response I sent to Wenchen’s questions, and the earlier discussion thread I
>>>>>>    posted to this list with the subject “[DISCUSS] Multiple catalog support”.
>>>>>>    The short answer is that implementations are configured with Spark config
>>>>>>    properties and loaded with reflection.
>>>>>>    - You asked how users implement an external catalog without
>>>>>>    adding new data sources. That’s also covered in the “Multiple catalog
>>>>>>    support” proposal, the table catalog PR, and ongoing discussions on the v2
>>>>>>    redesign. The answer is that a catalog returns a table instance that
>>>>>>    implements the various interfaces from Wenchen’s work. A table may
>>>>>>    implement them directly or return other existing implementations. Here’s
>>>>>>    how it worked in the old API
>>>>>>    <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
>>>>>>    .
>>>>>>
>>>>>> I hope that you don’t think I expect you to go “without seeing the
>>>>>> design”!
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>>>>>>
>>>>>>> Ryan,
>>>>>>>
>>>>>>> All the proposal I read is only related to Table metadata. Catalog
>>>>>>> contains the metadata of database, functions, columns, views, and so on.
>>>>>>> When we have multiple catalogs, how these catalogs interact with each
>>>>>>> other? How the global catalog works? How a view, table, function, database
>>>>>>> and column is resolved? Do we have nickname, mapping, wrapper?
>>>>>>>
>>>>>>> Or I might miss the design docs you send? Could you post the doc?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>>>>>>>
>>>>>>>> Xiao,
>>>>>>>>
>>>>>>>> Please have a look at the pull requests and documents I've posted
>>>>>>>> over the last few months.
>>>>>>>>
>>>>>>>> If you still have questions about how you might plug in Glue, let
>>>>>>>> me know and I can clarify.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ryan,
>>>>>>>>>
>>>>>>>>> Thanks for leading the discussion and sending out the memo!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Xiao suggested that there are restrictions for how tables and
>>>>>>>>>> functions interact. Because of this, he doesn’t think that separate
>>>>>>>>>> TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Anything is possible. It depends on how we design the two
>>>>>>>>> interfaces. Now, most parts are unknown to me without seeing the design.
>>>>>>>>>
>>>>>>>>> I think we need to see the user stories, and high-level design
>>>>>>>>> before working on a small portion of Catalog federation. We do not need an
>>>>>>>>> exhaustive design in the current stage, but we need to know how the new
>>>>>>>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>>>>>>>> plug in a Glue? How do users implement a new external catalog without
>>>>>>>>> adding any new data sources? Without knowing more details, it is hard to
>>>>>>>>> say whether this TableCatalog can satisfy all the requirements.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Xiao
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> Here are my notes from last night’s sync. Some attendees that
>>>>>>>>>> joined during discussion may be missing, since I made the list while we
>>>>>>>>>> were waiting for people to join.
>>>>>>>>>>
>>>>>>>>>> If you have topic suggestions for the next sync, please start
>>>>>>>>>> sending them to me. Thank you!
>>>>>>>>>>
>>>>>>>>>> *Attendees:*
>>>>>>>>>>
>>>>>>>>>> Ryan Blue
>>>>>>>>>> John Zhuge
>>>>>>>>>> Jamison Bennett
>>>>>>>>>> Yuanjian Li
>>>>>>>>>> Xiao Li
>>>>>>>>>> stczwd
>>>>>>>>>> Matt Cheah
>>>>>>>>>> Wenchen Fan
>>>>>>>>>> Genglian Wang
>>>>>>>>>> Kevin Yu
>>>>>>>>>> Maryann Xue
>>>>>>>>>> Cody Koeninger
>>>>>>>>>> Bruce Robbins
>>>>>>>>>> Rohit Karlupia
>>>>>>>>>>
>>>>>>>>>> *Agenda:*
>>>>>>>>>>
>>>>>>>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>>>>>>>    - TableCatalog proposal
>>>>>>>>>>    - CatalogTableIdentifier
>>>>>>>>>>
>>>>>>>>>> *Notes:*
>>>>>>>>>>
>>>>>>>>>>    - Discussion about PR #23086
>>>>>>>>>>       - Where should the catalog API live since it needs to be
>>>>>>>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>>>>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>>>>>>>       interfaces, making catalyst depend on it
>>>>>>>>>>       - Consensus was to use Wenchen’s suggestion
>>>>>>>>>>    - In discussion about #23086, Xiao asked how adding catalog
>>>>>>>>>>    to a table identifier will work
>>>>>>>>>>       - Background from Ryan: existing code paths use
>>>>>>>>>>       TableIdentifier and don’t expect a catalog portion. If an identifier with a
>>>>>>>>>>       catalog were passed to existing code, that code may use the default catalog
>>>>>>>>>>       not knowing that a different one was requested, which would be incorrect
>>>>>>>>>>       behavior.
>>>>>>>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses
>>>>>>>>>>       this problem. TableIdentifier is used for identifiers that have no catalog
>>>>>>>>>>       set. By enforcing that requirement, passing a TableIdentifier to old code
>>>>>>>>>>       ensures that no catalogs leak into that code. This is also used when the
>>>>>>>>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>>>>>>>>       TableIdentifier because the catalog is already determined.
>>>>>>>>>>    - Xiao asked whether FunctionIdentifier needs to be updated
>>>>>>>>>>    in the same way as CatalogTableIdentifier.
>>>>>>>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>>>>>>>    - The remaining time was spent discussing whether the plan to
>>>>>>>>>>    incrementally replace the current catalog API will work. [Not great notes
>>>>>>>>>>    here, feel free to add your take in a reply]
>>>>>>>>>>       - Xiao suggested that there are restrictions for how
>>>>>>>>>>       tables and functions interact. Because of this, he doesn’t think that
>>>>>>>>>>       separate TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>>>>>       - Wenchen and Ryan think that functions should be
>>>>>>>>>>       orthogonal to data sources
>>>>>>>>>>       - Matt and Ryan think that catalog design can be done
>>>>>>>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>>>>>>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>>>>>>>>       later
>>>>>>>>>>       - [I forget who] pointed out that there are restrictions
>>>>>>>>>>       in some databases for views from different sources
>>>>>>>>>>       - There was some discussion about when functions or views
>>>>>>>>>>       cannot be orthogonal. For example, where the code runs is important.
>>>>>>>>>>       Functions pushed to sources cannot necessarily be run on other sources and
>>>>>>>>>>       Spark functions cannot necessarily be pushed down to sources.
>>>>>>>>>>       - Xiao would like a full catalog replacement design,
>>>>>>>>>>       including views, databases, and functions and how they interact, before
>>>>>>>>>>       moving forward with the proposed TableCatalog API
>>>>>>>>>>       - Ryan [and Matt, I think] think that TableCatalog is
>>>>>>>>>>       compatible with future decisions and the best path forward is to build
>>>>>>>>>>       incrementally. An exhaustive design process blocks progress on v2.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>>>>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>>>>>>>
>>>>>>>>>>> We have a few topics left over from last time to cover. A few
>>>>>>>>>>> people wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>>>>>>>
>>>>>>>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>>>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code
>>>>>>>>>>>    paths and avoid unintended behavior changes
>>>>>>>>>>>
>>>>>>>>>>> As I noted in the summary last time, please send topics ahead of
>>>>>>>>>>> time so we can get started more quickly.
>>>>>>>>>>>
>>>>>>>>>>> If you would like to be added to the google hangout invite,
>>>>>>>>>>> please let me know and I’ll add you. Thanks!
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
> --
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Xiao Li <ga...@gmail.com>.

Hi, Ryan,

Let us first focus on answering the most fundamental problem before
discussing various related topics. What is a catalog in Spark SQL?

My definition of catalog is based on the database catalog. Basically, the
catalog provides a service that manage the metadata/definitions of database
objects (e.g., database, views, tables, functions, user roles, and so on).

In Spark SQL, all the external objects accessed through our data source
APIs are called "tables". I do not think we will expand the support in the
near future. That means, the metadata we need from the external data
sources are for table only.

These data sources should not use the Catalog identifier to identify. That
means, in "catalog.database.table", catalog is only used to identify the
actual catalog instead of data sources.

For a Spark cluster, we could mount multiple catalogs (e.g.,
hive_metastore_1, hive_metastore_2 and glue_1) at the same time. We could
get the metadata of the tables, database, functions by accessing different
catalog: "hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2",
"glue.db3.tab2". In the future, if Spark has its own catalog
implementation, we might have something like, "spark_catalog1.db3.tab2".
The catalog will be used for registering all the external data sources,
various Spark UDFs and so on.

At the same time, we should NOT mix the table-level data sources with
catalog support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1",
"Hbase3.db1.tab2" will not appear.

Do you agree on my definition of catalog in Spark SQL?

Xiao


Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午7:25写道：

> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
> I’ve tried to open discussions on this for the last 6+ months because we
> need it. I understand that you’d like a comprehensive plan for supporting
> more than one catalog before moving forward, but I think most of us are
> okay with the incremental approach. It’s better to make progress toward the
> goal.
>
> In general, data source API V2 and catalog API should be orthogonal
> I agree with you, and they are. The API that Wenchen is working on for
> reading and writing data and the TableCatalog API are orthogonal efforts.
> As I said, they only overlap with the Table interface, and clearly tables
> loaded from a catalog need to be able to plug into the read/write API.
>
> The reason these two efforts are related is that the community voted to standardize
> logical plans
> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
> Those standard plans have well-defined behavior for operations like CTAS,
> instead of relying on the data source plugin to do … something undefined.
> To implement this, we need a way for Spark to create tables, drop tables,
> etc. That’s why we need a way for sources to plug in Table-related catalog
> operations. (Sorry if this was already clear. I know I talked about it at
> length in the first v2 sync up.)
>
> While the two APIs are orthogonal and serve different purposes,
> implementing common operations requires that we have both.
>
> I would not call it a table catalog. I do not expect the data source
> should/need to implement a catalog. Since you might want an atomic CTAS, we
> can improve the table metadata resolution logic to support it with
> different resolution priorities. For example, try to get the metadata from
> the external data source, if the table metadata is not available in the
> catalog.
>
> It sounds like your definition of a “catalog” is different. I think you’re
> referring to a global catalog? Could you explain what you’re talking about
> here?
>
> I’m talking about an API to interface with an external data source, which
> I think we need for the reasons I outlined above. I don’t care what we call
> it, but your comment seems to hint that there would be an API to look up
> tables in external sources. That’s the thing I’m talking about.
>
> CatalogTableIdentifier: The PR is doing nothing but adding an interface.
>
> Yes. I opened this PR to discuss how Spark should track tables from
> different catalogs and avoid passing those references to code paths that
> don’t support them. The use of table identifiers with a catalog part was
> discussed in the “Multiple catalog support” thread. I’ve also brought it up
> and pointed out how I think it should be used in syncs a couple of times.
>
> Sorry if this discussion isn’t how you would have done it, but it’s a
> fairly simple idea that I don’t think requires its own doc.
>
> On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com> wrote:
>
>> Hi, Ryan,
>>
>> I try to avoid discussing each specific topic about the catalog
>> federation before we deciding the framework of multi-catalog supports.
>>
>> -  *CatalogTableIdentifier*: The PR
>> https://github.com/apache/spark/pull/21978 is doing nothing but adding
>> an interface. In the PR, we did not discuss how to resolve it, any
>> restriction on the naming and what is a catalog.This requires more doc for
>> explaining it. For example,
>> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
>> Normally, we do not merge a PR without showing how to use it.
>>
>> - *TableCatalog*: First, I would not call it a table catalog. I do not
>> expect the data source should/need to implement a catalog. Since you might
>> want an atomic CTAS, we can improve the table metadata resolution logic to
>> support it with different resolution priorities. For example, try to get
>> the metadata from the external data source, if the table metadata is not
>> available in the catalog. However, the catalog should do what the catalog
>> is expected to do. If we follow what our data source API V2 is doing,
>> basically, the data source is just a table. It is not related to database,
>> view, or function. Mixing catalog with data source API V2 just makes the
>> whole things more complex.
>>
>> In general, data source API V2 and catalog API should be orthogonal. I
>> believe the data source API V2 and catalog APIs are two separate projects.
>> Hopefully, you understand my concern. If we really want to mix them
>> together, I want to read the design of your multi-catalog support and
>> understand more details.
>>
>> Thanks,
>>
>> Xiao
>>
>>
>>
>>
>> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午3:22写道：
>>
>>> Xiao,
>>>
>>> I do have opinions about how multi-catalog support should work, but I
>>> don't think we are at a point where there is consensus. That's why I've
>>> started discussion threads and added the CatalogTableIdentifier PR instead
>>> of a comprehensive design doc. You have opinions about how users should
>>> interact with catalogs as well (your "federated catalog") and we should
>>> discuss our options here.
>>>
>>> But the crucial point is that the user interaction doesn't need to be
>>> completely decided in order to move forward. A design for multi-catalog
>>> support isn't what we need right now; we need an API that plugins can
>>> implement to expose table operations.
>>>
>>> I've proposed that API, TableCatalog, and a way to manage catalog
>>> plugins. I've made an argument for why I think that API is flexible enough
>>> for the task and still fairly simple.
>>>
>>> I think that we can add TableCatalog now and work on multi-catalog
>>> support incrementally, and I have yet to hear your argument for why that is
>>> not the case.
>>>
>>> rb
>>>
>>> On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> Hi, Ryan,
>>>>
>>>> Catalog is a really important component for Spark SQL or any analytics
>>>> platform, I have to emphasize. Thus, a careful design is needed to ensure
>>>> it works as expected. Based on my previous discussion with many community
>>>> members, Spark SQL needs a catalog interface so that we can mount multiple
>>>> external physical catalogs and they can be presented as a single logical
>>>> catalog [which is a so-called global federated catalog]. In the future, we
>>>> can use this interface to develop our own catalog (instead of Hive
>>>> metastore) for more efficient metadata management. We can also plug in ACL
>>>> management if needed.
>>>>
>>>> Based on your previous answers, it sounds like you have many ideas in
>>>> your mind about building a Catalog interface for Spark SQL, but it is not
>>>> shown in the design doc. Could you write them down in a single doc? We can
>>>> try to leave comments in the design doc, instead of discussing various
>>>> issues in PRs, emails and meetings. It can also help the whole community
>>>> understand your proposal and post their comments.
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>>
>>>>
>>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>>>>
>>>>> Xiao,
>>>>>
>>>>> For the questions in this last email about how catalogs interact and
>>>>> how functions and other future features work: we discussed those last
>>>>> night. As I said then, I think that the right approach is incremental. We
>>>>> don’t want to design all of that in one gigantic proposal up front. To do
>>>>> that is to put ourselves into analysis paralysis.
>>>>>
>>>>> We don’t have a design for how catalogs interact with one another, but
>>>>> I think we made a strong case for two points: first, that the proposed
>>>>> structure doesn’t preclude any of those future decisions (hence we should
>>>>> proceed incrementally). Second, that those situations aren’t that hard to
>>>>> think through if you’re concerned about them: functions that can run in
>>>>> Spark can be run on any data, functions that run in external sources cannot
>>>>> be run on any data.
>>>>>
>>>>> You’re right that I haven’t completely covered your *new* questions.
>>>>> But to the questions in your first email:
>>>>>
>>>>>    - You asked how, for example, Glue may be plugged in. That is well
>>>>>    covered in the PR that adds catalogs as a plugin
>>>>>    <https://github.com/apache/spark/pull/21306#issue-187572913>, the
>>>>>    response I sent to Wenchen’s questions, and the earlier discussion thread I
>>>>>    posted to this list with the subject “[DISCUSS] Multiple catalog support”.
>>>>>    The short answer is that implementations are configured with Spark config
>>>>>    properties and loaded with reflection.
>>>>>    - You asked how users implement an external catalog without adding
>>>>>    new data sources. That’s also covered in the “Multiple catalog support”
>>>>>    proposal, the table catalog PR, and ongoing discussions on the v2 redesign.
>>>>>    The answer is that a catalog returns a table instance that implements the
>>>>>    various interfaces from Wenchen’s work. A table may implement them directly
>>>>>    or return other existing implementations. Here’s how it worked in
>>>>>    the old API
>>>>>    <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
>>>>>    .
>>>>>
>>>>> I hope that you don’t think I expect you to go “without seeing the
>>>>> design”!
>>>>>
>>>>> rb
>>>>>
>>>>> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>>>>>
>>>>>> Ryan,
>>>>>>
>>>>>> All the proposal I read is only related to Table metadata. Catalog
>>>>>> contains the metadata of database, functions, columns, views, and so on.
>>>>>> When we have multiple catalogs, how these catalogs interact with each
>>>>>> other? How the global catalog works? How a view, table, function, database
>>>>>> and column is resolved? Do we have nickname, mapping, wrapper?
>>>>>>
>>>>>> Or I might miss the design docs you send? Could you post the doc?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>>>>>>
>>>>>>> Xiao,
>>>>>>>
>>>>>>> Please have a look at the pull requests and documents I've posted
>>>>>>> over the last few months.
>>>>>>>
>>>>>>> If you still have questions about how you might plug in Glue, let me
>>>>>>> know and I can clarify.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ryan,
>>>>>>>>
>>>>>>>> Thanks for leading the discussion and sending out the memo!
>>>>>>>>
>>>>>>>>
>>>>>>>>> Xiao suggested that there are restrictions for how tables and
>>>>>>>>> functions interact. Because of this, he doesn’t think that separate
>>>>>>>>> TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>>>
>>>>>>>>
>>>>>>>> Anything is possible. It depends on how we design the two
>>>>>>>> interfaces. Now, most parts are unknown to me without seeing the design.
>>>>>>>>
>>>>>>>> I think we need to see the user stories, and high-level design
>>>>>>>> before working on a small portion of Catalog federation. We do not need an
>>>>>>>> exhaustive design in the current stage, but we need to know how the new
>>>>>>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>>>>>>> plug in a Glue? How do users implement a new external catalog without
>>>>>>>> adding any new data sources? Without knowing more details, it is hard to
>>>>>>>> say whether this TableCatalog can satisfy all the requirements.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Xiao
>>>>>>>>
>>>>>>>>
>>>>>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> Here are my notes from last night’s sync. Some attendees that
>>>>>>>>> joined during discussion may be missing, since I made the list while we
>>>>>>>>> were waiting for people to join.
>>>>>>>>>
>>>>>>>>> If you have topic suggestions for the next sync, please start
>>>>>>>>> sending them to me. Thank you!
>>>>>>>>>
>>>>>>>>> *Attendees:*
>>>>>>>>>
>>>>>>>>> Ryan Blue
>>>>>>>>> John Zhuge
>>>>>>>>> Jamison Bennett
>>>>>>>>> Yuanjian Li
>>>>>>>>> Xiao Li
>>>>>>>>> stczwd
>>>>>>>>> Matt Cheah
>>>>>>>>> Wenchen Fan
>>>>>>>>> Genglian Wang
>>>>>>>>> Kevin Yu
>>>>>>>>> Maryann Xue
>>>>>>>>> Cody Koeninger
>>>>>>>>> Bruce Robbins
>>>>>>>>> Rohit Karlupia
>>>>>>>>>
>>>>>>>>> *Agenda:*
>>>>>>>>>
>>>>>>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>>>>>>    - TableCatalog proposal
>>>>>>>>>    - CatalogTableIdentifier
>>>>>>>>>
>>>>>>>>> *Notes:*
>>>>>>>>>
>>>>>>>>>    - Discussion about PR #23086
>>>>>>>>>       - Where should the catalog API live since it needs to be
>>>>>>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>>>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>>>>>>       interfaces, making catalyst depend on it
>>>>>>>>>       - Consensus was to use Wenchen’s suggestion
>>>>>>>>>    - In discussion about #23086, Xiao asked how adding catalog to
>>>>>>>>>    a table identifier will work
>>>>>>>>>       - Background from Ryan: existing code paths use
>>>>>>>>>       TableIdentifier and don’t expect a catalog portion. If an identifier with a
>>>>>>>>>       catalog were passed to existing code, that code may use the default catalog
>>>>>>>>>       not knowing that a different one was requested, which would be incorrect
>>>>>>>>>       behavior.
>>>>>>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses
>>>>>>>>>       this problem. TableIdentifier is used for identifiers that have no catalog
>>>>>>>>>       set. By enforcing that requirement, passing a TableIdentifier to old code
>>>>>>>>>       ensures that no catalogs leak into that code. This is also used when the
>>>>>>>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>>>>>>>       TableIdentifier because the catalog is already determined.
>>>>>>>>>    - Xiao asked whether FunctionIdentifier needs to be updated in
>>>>>>>>>    the same way as CatalogTableIdentifier.
>>>>>>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>>>>>>    - The remaining time was spent discussing whether the plan to
>>>>>>>>>    incrementally replace the current catalog API will work. [Not great notes
>>>>>>>>>    here, feel free to add your take in a reply]
>>>>>>>>>       - Xiao suggested that there are restrictions for how tables
>>>>>>>>>       and functions interact. Because of this, he doesn’t think that separate
>>>>>>>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>>>>       - Wenchen and Ryan think that functions should be
>>>>>>>>>       orthogonal to data sources
>>>>>>>>>       - Matt and Ryan think that catalog design can be done
>>>>>>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>>>>>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>>>>>>>       later
>>>>>>>>>       - [I forget who] pointed out that there are restrictions in
>>>>>>>>>       some databases for views from different sources
>>>>>>>>>       - There was some discussion about when functions or views
>>>>>>>>>       cannot be orthogonal. For example, where the code runs is important.
>>>>>>>>>       Functions pushed to sources cannot necessarily be run on other sources and
>>>>>>>>>       Spark functions cannot necessarily be pushed down to sources.
>>>>>>>>>       - Xiao would like a full catalog replacement design,
>>>>>>>>>       including views, databases, and functions and how they interact, before
>>>>>>>>>       moving forward with the proposed TableCatalog API
>>>>>>>>>       - Ryan [and Matt, I think] think that TableCatalog is
>>>>>>>>>       compatible with future decisions and the best path forward is to build
>>>>>>>>>       incrementally. An exhaustive design process blocks progress on v2.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>>>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>>>>>>
>>>>>>>>>> We have a few topics left over from last time to cover. A few
>>>>>>>>>> people wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>>>>>>
>>>>>>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code
>>>>>>>>>>    paths and avoid unintended behavior changes
>>>>>>>>>>
>>>>>>>>>> As I noted in the summary last time, please send topics ahead of
>>>>>>>>>> time so we can get started more quickly.
>>>>>>>>>>
>>>>>>>>>> If you would like to be added to the google hangout invite,
>>>>>>>>>> please let me know and I’ll add you. Thanks!
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I try to avoid discussing each specific topic about the catalog federation
before we deciding the framework of multi-catalog supports.

I’ve tried to open discussions on this for the last 6+ months because we
need it. I understand that you’d like a comprehensive plan for supporting
more than one catalog before moving forward, but I think most of us are
okay with the incremental approach. It’s better to make progress toward the
goal.

In general, data source API V2 and catalog API should be orthogonal
I agree with you, and they are. The API that Wenchen is working on for
reading and writing data and the TableCatalog API are orthogonal efforts.
As I said, they only overlap with the Table interface, and clearly tables
loaded from a catalog need to be able to plug into the read/write API.

The reason these two efforts are related is that the community voted
to standardize
logical plans
<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
Those standard plans have well-defined behavior for operations like CTAS,
instead of relying on the data source plugin to do … something undefined.
To implement this, we need a way for Spark to create tables, drop tables,
etc. That’s why we need a way for sources to plug in Table-related catalog
operations. (Sorry if this was already clear. I know I talked about it at
length in the first v2 sync up.)

While the two APIs are orthogonal and serve different purposes,
implementing common operations requires that we have both.

I would not call it a table catalog. I do not expect the data source
should/need to implement a catalog. Since you might want an atomic CTAS, we
can improve the table metadata resolution logic to support it with
different resolution priorities. For example, try to get the metadata from
the external data source, if the table metadata is not available in the
catalog.

It sounds like your definition of a “catalog” is different. I think you’re
referring to a global catalog? Could you explain what you’re talking about
here?

I’m talking about an API to interface with an external data source, which I
think we need for the reasons I outlined above. I don’t care what we call
it, but your comment seems to hint that there would be an API to look up
tables in external sources. That’s the thing I’m talking about.

CatalogTableIdentifier: The PR is doing nothing but adding an interface.

Yes. I opened this PR to discuss how Spark should track tables from
different catalogs and avoid passing those references to code paths that
don’t support them. The use of table identifiers with a catalog part was
discussed in the “Multiple catalog support” thread. I’ve also brought it up
and pointed out how I think it should be used in syncs a couple of times.

Sorry if this discussion isn’t how you would have done it, but it’s a
fairly simple idea that I don’t think requires its own doc.

On Sat, Dec 1, 2018 at 5:12 PM Xiao Li <ga...@gmail.com> wrote:

> Hi, Ryan,
>
> I try to avoid discussing each specific topic about the catalog federation
> before we deciding the framework of multi-catalog supports.
>
> -  *CatalogTableIdentifier*: The PR
> https://github.com/apache/spark/pull/21978 is doing nothing but adding an
> interface. In the PR, we did not discuss how to resolve it, any restriction
> on the naming and what is a catalog.This requires more doc for explaining
> it. For example,
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
> Normally, we do not merge a PR without showing how to use it.
>
> - *TableCatalog*: First, I would not call it a table catalog. I do not
> expect the data source should/need to implement a catalog. Since you might
> want an atomic CTAS, we can improve the table metadata resolution logic to
> support it with different resolution priorities. For example, try to get
> the metadata from the external data source, if the table metadata is not
> available in the catalog. However, the catalog should do what the catalog
> is expected to do. If we follow what our data source API V2 is doing,
> basically, the data source is just a table. It is not related to database,
> view, or function. Mixing catalog with data source API V2 just makes the
> whole things more complex.
>
> In general, data source API V2 and catalog API should be orthogonal. I
> believe the data source API V2 and catalog APIs are two separate projects.
> Hopefully, you understand my concern. If we really want to mix them
> together, I want to read the design of your multi-catalog support and
> understand more details.
>
> Thanks,
>
> Xiao
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午3:22写道：
>
>> Xiao,
>>
>> I do have opinions about how multi-catalog support should work, but I
>> don't think we are at a point where there is consensus. That's why I've
>> started discussion threads and added the CatalogTableIdentifier PR instead
>> of a comprehensive design doc. You have opinions about how users should
>> interact with catalogs as well (your "federated catalog") and we should
>> discuss our options here.
>>
>> But the crucial point is that the user interaction doesn't need to be
>> completely decided in order to move forward. A design for multi-catalog
>> support isn't what we need right now; we need an API that plugins can
>> implement to expose table operations.
>>
>> I've proposed that API, TableCatalog, and a way to manage catalog
>> plugins. I've made an argument for why I think that API is flexible enough
>> for the task and still fairly simple.
>>
>> I think that we can add TableCatalog now and work on multi-catalog
>> support incrementally, and I have yet to hear your argument for why that is
>> not the case.
>>
>> rb
>>
>> On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> Hi, Ryan,
>>>
>>> Catalog is a really important component for Spark SQL or any analytics
>>> platform, I have to emphasize. Thus, a careful design is needed to ensure
>>> it works as expected. Based on my previous discussion with many community
>>> members, Spark SQL needs a catalog interface so that we can mount multiple
>>> external physical catalogs and they can be presented as a single logical
>>> catalog [which is a so-called global federated catalog]. In the future, we
>>> can use this interface to develop our own catalog (instead of Hive
>>> metastore) for more efficient metadata management. We can also plug in ACL
>>> management if needed.
>>>
>>> Based on your previous answers, it sounds like you have many ideas in
>>> your mind about building a Catalog interface for Spark SQL, but it is not
>>> shown in the design doc. Could you write them down in a single doc? We can
>>> try to leave comments in the design doc, instead of discussing various
>>> issues in PRs, emails and meetings. It can also help the whole community
>>> understand your proposal and post their comments.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>>
>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>>>
>>>> Xiao,
>>>>
>>>> For the questions in this last email about how catalogs interact and
>>>> how functions and other future features work: we discussed those last
>>>> night. As I said then, I think that the right approach is incremental. We
>>>> don’t want to design all of that in one gigantic proposal up front. To do
>>>> that is to put ourselves into analysis paralysis.
>>>>
>>>> We don’t have a design for how catalogs interact with one another, but
>>>> I think we made a strong case for two points: first, that the proposed
>>>> structure doesn’t preclude any of those future decisions (hence we should
>>>> proceed incrementally). Second, that those situations aren’t that hard to
>>>> think through if you’re concerned about them: functions that can run in
>>>> Spark can be run on any data, functions that run in external sources cannot
>>>> be run on any data.
>>>>
>>>> You’re right that I haven’t completely covered your *new* questions.
>>>> But to the questions in your first email:
>>>>
>>>>    - You asked how, for example, Glue may be plugged in. That is well
>>>>    covered in the PR that adds catalogs as a plugin
>>>>    <https://github.com/apache/spark/pull/21306#issue-187572913>, the
>>>>    response I sent to Wenchen’s questions, and the earlier discussion thread I
>>>>    posted to this list with the subject “[DISCUSS] Multiple catalog support”.
>>>>    The short answer is that implementations are configured with Spark config
>>>>    properties and loaded with reflection.
>>>>    - You asked how users implement an external catalog without adding
>>>>    new data sources. That’s also covered in the “Multiple catalog support”
>>>>    proposal, the table catalog PR, and ongoing discussions on the v2 redesign.
>>>>    The answer is that a catalog returns a table instance that implements the
>>>>    various interfaces from Wenchen’s work. A table may implement them directly
>>>>    or return other existing implementations. Here’s how it worked in
>>>>    the old API
>>>>    <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
>>>>    .
>>>>
>>>> I hope that you don’t think I expect you to go “without seeing the
>>>> design”!
>>>>
>>>> rb
>>>>
>>>> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>>> Ryan,
>>>>>
>>>>> All the proposal I read is only related to Table metadata. Catalog
>>>>> contains the metadata of database, functions, columns, views, and so on.
>>>>> When we have multiple catalogs, how these catalogs interact with each
>>>>> other? How the global catalog works? How a view, table, function, database
>>>>> and column is resolved? Do we have nickname, mapping, wrapper?
>>>>>
>>>>> Or I might miss the design docs you send? Could you post the doc?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>>>>>
>>>>>> Xiao,
>>>>>>
>>>>>> Please have a look at the pull requests and documents I've posted
>>>>>> over the last few months.
>>>>>>
>>>>>> If you still have questions about how you might plug in Glue, let me
>>>>>> know and I can clarify.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>>>>>>
>>>>>>> Ryan,
>>>>>>>
>>>>>>> Thanks for leading the discussion and sending out the memo!
>>>>>>>
>>>>>>>
>>>>>>>> Xiao suggested that there are restrictions for how tables and
>>>>>>>> functions interact. Because of this, he doesn’t think that separate
>>>>>>>> TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>>
>>>>>>>
>>>>>>> Anything is possible. It depends on how we design the two
>>>>>>> interfaces. Now, most parts are unknown to me without seeing the design.
>>>>>>>
>>>>>>> I think we need to see the user stories, and high-level design
>>>>>>> before working on a small portion of Catalog federation. We do not need an
>>>>>>> exhaustive design in the current stage, but we need to know how the new
>>>>>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>>>>>> plug in a Glue? How do users implement a new external catalog without
>>>>>>> adding any new data sources? Without knowing more details, it is hard to
>>>>>>> say whether this TableCatalog can satisfy all the requirements.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>>
>>>>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> Here are my notes from last night’s sync. Some attendees that
>>>>>>>> joined during discussion may be missing, since I made the list while we
>>>>>>>> were waiting for people to join.
>>>>>>>>
>>>>>>>> If you have topic suggestions for the next sync, please start
>>>>>>>> sending them to me. Thank you!
>>>>>>>>
>>>>>>>> *Attendees:*
>>>>>>>>
>>>>>>>> Ryan Blue
>>>>>>>> John Zhuge
>>>>>>>> Jamison Bennett
>>>>>>>> Yuanjian Li
>>>>>>>> Xiao Li
>>>>>>>> stczwd
>>>>>>>> Matt Cheah
>>>>>>>> Wenchen Fan
>>>>>>>> Genglian Wang
>>>>>>>> Kevin Yu
>>>>>>>> Maryann Xue
>>>>>>>> Cody Koeninger
>>>>>>>> Bruce Robbins
>>>>>>>> Rohit Karlupia
>>>>>>>>
>>>>>>>> *Agenda:*
>>>>>>>>
>>>>>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>>>>>    - TableCatalog proposal
>>>>>>>>    - CatalogTableIdentifier
>>>>>>>>
>>>>>>>> *Notes:*
>>>>>>>>
>>>>>>>>    - Discussion about PR #23086
>>>>>>>>       - Where should the catalog API live since it needs to be
>>>>>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>>>>>       interfaces, making catalyst depend on it
>>>>>>>>       - Consensus was to use Wenchen’s suggestion
>>>>>>>>    - In discussion about #23086, Xiao asked how adding catalog to
>>>>>>>>    a table identifier will work
>>>>>>>>       - Background from Ryan: existing code paths use
>>>>>>>>       TableIdentifier and don’t expect a catalog portion. If an identifier with a
>>>>>>>>       catalog were passed to existing code, that code may use the default catalog
>>>>>>>>       not knowing that a different one was requested, which would be incorrect
>>>>>>>>       behavior.
>>>>>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses
>>>>>>>>       this problem. TableIdentifier is used for identifiers that have no catalog
>>>>>>>>       set. By enforcing that requirement, passing a TableIdentifier to old code
>>>>>>>>       ensures that no catalogs leak into that code. This is also used when the
>>>>>>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>>>>>>       TableIdentifier because the catalog is already determined.
>>>>>>>>    - Xiao asked whether FunctionIdentifier needs to be updated in
>>>>>>>>    the same way as CatalogTableIdentifier.
>>>>>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>>>>>    - The remaining time was spent discussing whether the plan to
>>>>>>>>    incrementally replace the current catalog API will work. [Not great notes
>>>>>>>>    here, feel free to add your take in a reply]
>>>>>>>>       - Xiao suggested that there are restrictions for how tables
>>>>>>>>       and functions interact. Because of this, he doesn’t think that separate
>>>>>>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>>>       - Wenchen and Ryan think that functions should be orthogonal
>>>>>>>>       to data sources
>>>>>>>>       - Matt and Ryan think that catalog design can be done
>>>>>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>>>>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>>>>>>       later
>>>>>>>>       - [I forget who] pointed out that there are restrictions in
>>>>>>>>       some databases for views from different sources
>>>>>>>>       - There was some discussion about when functions or views
>>>>>>>>       cannot be orthogonal. For example, where the code runs is important.
>>>>>>>>       Functions pushed to sources cannot necessarily be run on other sources and
>>>>>>>>       Spark functions cannot necessarily be pushed down to sources.
>>>>>>>>       - Xiao would like a full catalog replacement design,
>>>>>>>>       including views, databases, and functions and how they interact, before
>>>>>>>>       moving forward with the proposed TableCatalog API
>>>>>>>>       - Ryan [and Matt, I think] think that TableCatalog is
>>>>>>>>       compatible with future decisions and the best path forward is to build
>>>>>>>>       incrementally. An exhaustive design process blocks progress on v2.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>>>>>
>>>>>>>>> We have a few topics left over from last time to cover. A few
>>>>>>>>> people wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>>>>>
>>>>>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code
>>>>>>>>>    paths and avoid unintended behavior changes
>>>>>>>>>
>>>>>>>>> As I noted in the summary last time, please send topics ahead of
>>>>>>>>> time so we can get started more quickly.
>>>>>>>>>
>>>>>>>>> If you would like to be added to the google hangout invite, please
>>>>>>>>> let me know and I’ll add you. Thanks!
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Xiao Li <ga...@gmail.com>.

Hi, Ryan,

I try to avoid discussing each specific topic about the catalog federation
before we deciding the framework of multi-catalog supports.

-  *CatalogTableIdentifier*: The PR
https://github.com/apache/spark/pull/21978 is doing nothing but adding an
interface. In the PR, we did not discuss how to resolve it, any restriction
on the naming and what is a catalog.This requires more doc for explaining
it. For example,
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
Normally, we do not merge a PR without showing how to use it.

- *TableCatalog*: First, I would not call it a table catalog. I do not
expect the data source should/need to implement a catalog. Since you might
want an atomic CTAS, we can improve the table metadata resolution logic to
support it with different resolution priorities. For example, try to get
the metadata from the external data source, if the table metadata is not
available in the catalog. However, the catalog should do what the catalog
is expected to do. If we follow what our data source API V2 is doing,
basically, the data source is just a table. It is not related to database,
view, or function. Mixing catalog with data source API V2 just makes the
whole things more complex.

In general, data source API V2 and catalog API should be orthogonal. I
believe the data source API V2 and catalog APIs are two separate projects.
Hopefully, you understand my concern. If we really want to mix them
together, I want to read the design of your multi-catalog support and
understand more details.

Thanks,

Xiao




Ryan Blue <rb...@netflix.com> 于2018年12月1日周六 下午3:22写道：

> Xiao,
>
> I do have opinions about how multi-catalog support should work, but I
> don't think we are at a point where there is consensus. That's why I've
> started discussion threads and added the CatalogTableIdentifier PR instead
> of a comprehensive design doc. You have opinions about how users should
> interact with catalogs as well (your "federated catalog") and we should
> discuss our options here.
>
> But the crucial point is that the user interaction doesn't need to be
> completely decided in order to move forward. A design for multi-catalog
> support isn't what we need right now; we need an API that plugins can
> implement to expose table operations.
>
> I've proposed that API, TableCatalog, and a way to manage catalog plugins.
> I've made an argument for why I think that API is flexible enough for the
> task and still fairly simple.
>
> I think that we can add TableCatalog now and work on multi-catalog support
> incrementally, and I have yet to hear your argument for why that is not the
> case.
>
> rb
>
> On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:
>
>> Hi, Ryan,
>>
>> Catalog is a really important component for Spark SQL or any analytics
>> platform, I have to emphasize. Thus, a careful design is needed to ensure
>> it works as expected. Based on my previous discussion with many community
>> members, Spark SQL needs a catalog interface so that we can mount multiple
>> external physical catalogs and they can be presented as a single logical
>> catalog [which is a so-called global federated catalog]. In the future, we
>> can use this interface to develop our own catalog (instead of Hive
>> metastore) for more efficient metadata management. We can also plug in ACL
>> management if needed.
>>
>> Based on your previous answers, it sounds like you have many ideas in
>> your mind about building a Catalog interface for Spark SQL, but it is not
>> shown in the design doc. Could you write them down in a single doc? We can
>> try to leave comments in the design doc, instead of discussing various
>> issues in PRs, emails and meetings. It can also help the whole community
>> understand your proposal and post their comments.
>>
>> Thanks,
>>
>> Xiao
>>
>>
>>
>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>>
>>> Xiao,
>>>
>>> For the questions in this last email about how catalogs interact and how
>>> functions and other future features work: we discussed those last night. As
>>> I said then, I think that the right approach is incremental. We don’t want
>>> to design all of that in one gigantic proposal up front. To do that is to
>>> put ourselves into analysis paralysis.
>>>
>>> We don’t have a design for how catalogs interact with one another, but I
>>> think we made a strong case for two points: first, that the proposed
>>> structure doesn’t preclude any of those future decisions (hence we should
>>> proceed incrementally). Second, that those situations aren’t that hard to
>>> think through if you’re concerned about them: functions that can run in
>>> Spark can be run on any data, functions that run in external sources cannot
>>> be run on any data.
>>>
>>> You’re right that I haven’t completely covered your *new* questions.
>>> But to the questions in your first email:
>>>
>>>    - You asked how, for example, Glue may be plugged in. That is well
>>>    covered in the PR that adds catalogs as a plugin
>>>    <https://github.com/apache/spark/pull/21306#issue-187572913>, the
>>>    response I sent to Wenchen’s questions, and the earlier discussion thread I
>>>    posted to this list with the subject “[DISCUSS] Multiple catalog support”.
>>>    The short answer is that implementations are configured with Spark config
>>>    properties and loaded with reflection.
>>>    - You asked how users implement an external catalog without adding
>>>    new data sources. That’s also covered in the “Multiple catalog support”
>>>    proposal, the table catalog PR, and ongoing discussions on the v2 redesign.
>>>    The answer is that a catalog returns a table instance that implements the
>>>    various interfaces from Wenchen’s work. A table may implement them directly
>>>    or return other existing implementations. Here’s how it worked in
>>>    the old API
>>>    <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
>>>    .
>>>
>>> I hope that you don’t think I expect you to go “without seeing the
>>> design”!
>>>
>>> rb
>>>
>>> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> Ryan,
>>>>
>>>> All the proposal I read is only related to Table metadata. Catalog
>>>> contains the metadata of database, functions, columns, views, and so on.
>>>> When we have multiple catalogs, how these catalogs interact with each
>>>> other? How the global catalog works? How a view, table, function, database
>>>> and column is resolved? Do we have nickname, mapping, wrapper?
>>>>
>>>> Or I might miss the design docs you send? Could you post the doc?
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>>
>>>>
>>>>
>>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>>>>
>>>>> Xiao,
>>>>>
>>>>> Please have a look at the pull requests and documents I've posted over
>>>>> the last few months.
>>>>>
>>>>> If you still have questions about how you might plug in Glue, let me
>>>>> know and I can clarify.
>>>>>
>>>>> rb
>>>>>
>>>>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>>>>>
>>>>>> Ryan,
>>>>>>
>>>>>> Thanks for leading the discussion and sending out the memo!
>>>>>>
>>>>>>
>>>>>>> Xiao suggested that there are restrictions for how tables and
>>>>>>> functions interact. Because of this, he doesn’t think that separate
>>>>>>> TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>
>>>>>>
>>>>>> Anything is possible. It depends on how we design the two interfaces.
>>>>>> Now, most parts are unknown to me without seeing the design.
>>>>>>
>>>>>> I think we need to see the user stories, and high-level design before
>>>>>> working on a small portion of Catalog federation. We do not need an
>>>>>> exhaustive design in the current stage, but we need to know how the new
>>>>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>>>>> plug in a Glue? How do users implement a new external catalog without
>>>>>> adding any new data sources? Without knowing more details, it is hard to
>>>>>> say whether this TableCatalog can satisfy all the requirements.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Here are my notes from last night’s sync. Some attendees that joined
>>>>>>> during discussion may be missing, since I made the list while we were
>>>>>>> waiting for people to join.
>>>>>>>
>>>>>>> If you have topic suggestions for the next sync, please start
>>>>>>> sending them to me. Thank you!
>>>>>>>
>>>>>>> *Attendees:*
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>> John Zhuge
>>>>>>> Jamison Bennett
>>>>>>> Yuanjian Li
>>>>>>> Xiao Li
>>>>>>> stczwd
>>>>>>> Matt Cheah
>>>>>>> Wenchen Fan
>>>>>>> Genglian Wang
>>>>>>> Kevin Yu
>>>>>>> Maryann Xue
>>>>>>> Cody Koeninger
>>>>>>> Bruce Robbins
>>>>>>> Rohit Karlupia
>>>>>>>
>>>>>>> *Agenda:*
>>>>>>>
>>>>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>>>>    - TableCatalog proposal
>>>>>>>    - CatalogTableIdentifier
>>>>>>>
>>>>>>> *Notes:*
>>>>>>>
>>>>>>>    - Discussion about PR #23086
>>>>>>>       - Where should the catalog API live since it needs to be
>>>>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>>>>       interfaces, making catalyst depend on it
>>>>>>>       - Consensus was to use Wenchen’s suggestion
>>>>>>>    - In discussion about #23086, Xiao asked how adding catalog to a
>>>>>>>    table identifier will work
>>>>>>>       - Background from Ryan: existing code paths use
>>>>>>>       TableIdentifier and don’t expect a catalog portion. If an identifier with a
>>>>>>>       catalog were passed to existing code, that code may use the default catalog
>>>>>>>       not knowing that a different one was requested, which would be incorrect
>>>>>>>       behavior.
>>>>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses
>>>>>>>       this problem. TableIdentifier is used for identifiers that have no catalog
>>>>>>>       set. By enforcing that requirement, passing a TableIdentifier to old code
>>>>>>>       ensures that no catalogs leak into that code. This is also used when the
>>>>>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>>>>>       TableIdentifier because the catalog is already determined.
>>>>>>>    - Xiao asked whether FunctionIdentifier needs to be updated in
>>>>>>>    the same way as CatalogTableIdentifier.
>>>>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>>>>    - The remaining time was spent discussing whether the plan to
>>>>>>>    incrementally replace the current catalog API will work. [Not great notes
>>>>>>>    here, feel free to add your take in a reply]
>>>>>>>       - Xiao suggested that there are restrictions for how tables
>>>>>>>       and functions interact. Because of this, he doesn’t think that separate
>>>>>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>>       - Wenchen and Ryan think that functions should be orthogonal
>>>>>>>       to data sources
>>>>>>>       - Matt and Ryan think that catalog design can be done
>>>>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>>>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>>>>>       later
>>>>>>>       - [I forget who] pointed out that there are restrictions in
>>>>>>>       some databases for views from different sources
>>>>>>>       - There was some discussion about when functions or views
>>>>>>>       cannot be orthogonal. For example, where the code runs is important.
>>>>>>>       Functions pushed to sources cannot necessarily be run on other sources and
>>>>>>>       Spark functions cannot necessarily be pushed down to sources.
>>>>>>>       - Xiao would like a full catalog replacement design,
>>>>>>>       including views, databases, and functions and how they interact, before
>>>>>>>       moving forward with the proposed TableCatalog API
>>>>>>>       - Ryan [and Matt, I think] think that TableCatalog is
>>>>>>>       compatible with future decisions and the best path forward is to build
>>>>>>>       incrementally. An exhaustive design process blocks progress on v2.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>>>>
>>>>>>>> We have a few topics left over from last time to cover. A few
>>>>>>>> people wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>>>>
>>>>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths
>>>>>>>>    and avoid unintended behavior changes
>>>>>>>>
>>>>>>>> As I noted in the summary last time, please send topics ahead of
>>>>>>>> time so we can get started more quickly.
>>>>>>>>
>>>>>>>> If you would like to be added to the google hangout invite, please
>>>>>>>> let me know and I’ll add you. Thanks!
>>>>>>>>
>>>>>>>> rb
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Xiao,

I do have opinions about how multi-catalog support should work, but I don't
think we are at a point where there is consensus. That's why I've started
discussion threads and added the CatalogTableIdentifier PR instead of a
comprehensive design doc. You have opinions about how users should interact
with catalogs as well (your "federated catalog") and we should discuss our
options here.

But the crucial point is that the user interaction doesn't need to be
completely decided in order to move forward. A design for multi-catalog
support isn't what we need right now; we need an API that plugins can
implement to expose table operations.

I've proposed that API, TableCatalog, and a way to manage catalog plugins.
I've made an argument for why I think that API is flexible enough for the
task and still fairly simple.

I think that we can add TableCatalog now and work on multi-catalog support
incrementally, and I have yet to hear your argument for why that is not the
case.

rb

On Sat, Dec 1, 2018 at 12:36 PM Xiao Li <ga...@gmail.com> wrote:

> Hi, Ryan,
>
> Catalog is a really important component for Spark SQL or any analytics
> platform, I have to emphasize. Thus, a careful design is needed to ensure
> it works as expected. Based on my previous discussion with many community
> members, Spark SQL needs a catalog interface so that we can mount multiple
> external physical catalogs and they can be presented as a single logical
> catalog [which is a so-called global federated catalog]. In the future, we
> can use this interface to develop our own catalog (instead of Hive
> metastore) for more efficient metadata management. We can also plug in ACL
> management if needed.
>
> Based on your previous answers, it sounds like you have many ideas in your
> mind about building a Catalog interface for Spark SQL, but it is not shown
> in the design doc. Could you write them down in a single doc? We can try to
> leave comments in the design doc, instead of discussing various issues in
> PRs, emails and meetings. It can also help the whole community understand
> your proposal and post their comments.
>
> Thanks,
>
> Xiao
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：
>
>> Xiao,
>>
>> For the questions in this last email about how catalogs interact and how
>> functions and other future features work: we discussed those last night. As
>> I said then, I think that the right approach is incremental. We don’t want
>> to design all of that in one gigantic proposal up front. To do that is to
>> put ourselves into analysis paralysis.
>>
>> We don’t have a design for how catalogs interact with one another, but I
>> think we made a strong case for two points: first, that the proposed
>> structure doesn’t preclude any of those future decisions (hence we should
>> proceed incrementally). Second, that those situations aren’t that hard to
>> think through if you’re concerned about them: functions that can run in
>> Spark can be run on any data, functions that run in external sources cannot
>> be run on any data.
>>
>> You’re right that I haven’t completely covered your *new* questions. But
>> to the questions in your first email:
>>
>>    - You asked how, for example, Glue may be plugged in. That is well
>>    covered in the PR that adds catalogs as a plugin
>>    <https://github.com/apache/spark/pull/21306#issue-187572913>, the
>>    response I sent to Wenchen’s questions, and the earlier discussion thread I
>>    posted to this list with the subject “[DISCUSS] Multiple catalog support”.
>>    The short answer is that implementations are configured with Spark config
>>    properties and loaded with reflection.
>>    - You asked how users implement an external catalog without adding
>>    new data sources. That’s also covered in the “Multiple catalog support”
>>    proposal, the table catalog PR, and ongoing discussions on the v2 redesign.
>>    The answer is that a catalog returns a table instance that implements the
>>    various interfaces from Wenchen’s work. A table may implement them directly
>>    or return other existing implementations. Here’s how it worked in the
>>    old API
>>    <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
>>    .
>>
>> I hope that you don’t think I expect you to go “without seeing the
>> design”!
>>
>> rb
>>
>> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> Ryan,
>>>
>>> All the proposal I read is only related to Table metadata. Catalog
>>> contains the metadata of database, functions, columns, views, and so on.
>>> When we have multiple catalogs, how these catalogs interact with each
>>> other? How the global catalog works? How a view, table, function, database
>>> and column is resolved? Do we have nickname, mapping, wrapper?
>>>
>>> Or I might miss the design docs you send? Could you post the doc?
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>>>
>>>> Xiao,
>>>>
>>>> Please have a look at the pull requests and documents I've posted over
>>>> the last few months.
>>>>
>>>> If you still have questions about how you might plug in Glue, let me
>>>> know and I can clarify.
>>>>
>>>> rb
>>>>
>>>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>>> Ryan,
>>>>>
>>>>> Thanks for leading the discussion and sending out the memo!
>>>>>
>>>>>
>>>>>> Xiao suggested that there are restrictions for how tables and
>>>>>> functions interact. Because of this, he doesn’t think that separate
>>>>>> TableCatalog and FunctionCatalog APIs are feasible.
>>>>>
>>>>>
>>>>> Anything is possible. It depends on how we design the two interfaces.
>>>>> Now, most parts are unknown to me without seeing the design.
>>>>>
>>>>> I think we need to see the user stories, and high-level design before
>>>>> working on a small portion of Catalog federation. We do not need an
>>>>> exhaustive design in the current stage, but we need to know how the new
>>>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>>>> plug in a Glue? How do users implement a new external catalog without
>>>>> adding any new data sources? Without knowing more details, it is hard to
>>>>> say whether this TableCatalog can satisfy all the requirements.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Here are my notes from last night’s sync. Some attendees that joined
>>>>>> during discussion may be missing, since I made the list while we were
>>>>>> waiting for people to join.
>>>>>>
>>>>>> If you have topic suggestions for the next sync, please start sending
>>>>>> them to me. Thank you!
>>>>>>
>>>>>> *Attendees:*
>>>>>>
>>>>>> Ryan Blue
>>>>>> John Zhuge
>>>>>> Jamison Bennett
>>>>>> Yuanjian Li
>>>>>> Xiao Li
>>>>>> stczwd
>>>>>> Matt Cheah
>>>>>> Wenchen Fan
>>>>>> Genglian Wang
>>>>>> Kevin Yu
>>>>>> Maryann Xue
>>>>>> Cody Koeninger
>>>>>> Bruce Robbins
>>>>>> Rohit Karlupia
>>>>>>
>>>>>> *Agenda:*
>>>>>>
>>>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>>>    - TableCatalog proposal
>>>>>>    - CatalogTableIdentifier
>>>>>>
>>>>>> *Notes:*
>>>>>>
>>>>>>    - Discussion about PR #23086
>>>>>>       - Where should the catalog API live since it needs to be
>>>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>>>       interfaces, making catalyst depend on it
>>>>>>       - Consensus was to use Wenchen’s suggestion
>>>>>>    - In discussion about #23086, Xiao asked how adding catalog to a
>>>>>>    table identifier will work
>>>>>>       - Background from Ryan: existing code paths use
>>>>>>       TableIdentifier and don’t expect a catalog portion. If an identifier with a
>>>>>>       catalog were passed to existing code, that code may use the default catalog
>>>>>>       not knowing that a different one was requested, which would be incorrect
>>>>>>       behavior.
>>>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses this
>>>>>>       problem. TableIdentifier is used for identifiers that have no catalog set.
>>>>>>       By enforcing that requirement, passing a TableIdentifier to old code
>>>>>>       ensures that no catalogs leak into that code. This is also used when the
>>>>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>>>>       TableIdentifier because the catalog is already determined.
>>>>>>    - Xiao asked whether FunctionIdentifier needs to be updated in
>>>>>>    the same way as CatalogTableIdentifier.
>>>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>>>    - The remaining time was spent discussing whether the plan to
>>>>>>    incrementally replace the current catalog API will work. [Not great notes
>>>>>>    here, feel free to add your take in a reply]
>>>>>>       - Xiao suggested that there are restrictions for how tables
>>>>>>       and functions interact. Because of this, he doesn’t think that separate
>>>>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>>>>       - Wenchen and Ryan think that functions should be orthogonal
>>>>>>       to data sources
>>>>>>       - Matt and Ryan think that catalog design can be done
>>>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>>>>       later
>>>>>>       - [I forget who] pointed out that there are restrictions in
>>>>>>       some databases for views from different sources
>>>>>>       - There was some discussion about when functions or views
>>>>>>       cannot be orthogonal. For example, where the code runs is important.
>>>>>>       Functions pushed to sources cannot necessarily be run on other sources and
>>>>>>       Spark functions cannot necessarily be pushed down to sources.
>>>>>>       - Xiao would like a full catalog replacement design, including
>>>>>>       views, databases, and functions and how they interact, before moving
>>>>>>       forward with the proposed TableCatalog API
>>>>>>       - Ryan [and Matt, I think] think that TableCatalog is
>>>>>>       compatible with future decisions and the best path forward is to build
>>>>>>       incrementally. An exhaustive design process blocks progress on v2.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>>>
>>>>>>> We have a few topics left over from last time to cover. A few people
>>>>>>> wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>>>
>>>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths
>>>>>>>    and avoid unintended behavior changes
>>>>>>>
>>>>>>> As I noted in the summary last time, please send topics ahead of
>>>>>>> time so we can get started more quickly.
>>>>>>>
>>>>>>> If you would like to be added to the google hangout invite, please
>>>>>>> let me know and I’ll add you. Thanks!
>>>>>>>
>>>>>>> rb
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Xiao Li <ga...@gmail.com>.

Hi, Ryan,

Catalog is a really important component for Spark SQL or any analytics
platform, I have to emphasize. Thus, a careful design is needed to ensure
it works as expected. Based on my previous discussion with many community
members, Spark SQL needs a catalog interface so that we can mount multiple
external physical catalogs and they can be presented as a single logical
catalog [which is a so-called global federated catalog]. In the future, we
can use this interface to develop our own catalog (instead of Hive
metastore) for more efficient metadata management. We can also plug in ACL
management if needed.

Based on your previous answers, it sounds like you have many ideas in your
mind about building a Catalog interface for Spark SQL, but it is not shown
in the design doc. Could you write them down in a single doc? We can try to
leave comments in the design doc, instead of discussing various issues in
PRs, emails and meetings. It can also help the whole community understand
your proposal and post their comments.

Thanks,

Xiao



Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：

> Xiao,
>
> For the questions in this last email about how catalogs interact and how
> functions and other future features work: we discussed those last night. As
> I said then, I think that the right approach is incremental. We don’t want
> to design all of that in one gigantic proposal up front. To do that is to
> put ourselves into analysis paralysis.
>
> We don’t have a design for how catalogs interact with one another, but I
> think we made a strong case for two points: first, that the proposed
> structure doesn’t preclude any of those future decisions (hence we should
> proceed incrementally). Second, that those situations aren’t that hard to
> think through if you’re concerned about them: functions that can run in
> Spark can be run on any data, functions that run in external sources cannot
> be run on any data.
>
> You’re right that I haven’t completely covered your *new* questions. But
> to the questions in your first email:
>
>    - You asked how, for example, Glue may be plugged in. That is well
>    covered in the PR that adds catalogs as a plugin
>    <https://github.com/apache/spark/pull/21306#issue-187572913>, the
>    response I sent to Wenchen’s questions, and the earlier discussion thread I
>    posted to this list with the subject “[DISCUSS] Multiple catalog support”.
>    The short answer is that implementations are configured with Spark config
>    properties and loaded with reflection.
>    - You asked how users implement an external catalog without adding new
>    data sources. That’s also covered in the “Multiple catalog support”
>    proposal, the table catalog PR, and ongoing discussions on the v2 redesign.
>    The answer is that a catalog returns a table instance that implements the
>    various interfaces from Wenchen’s work. A table may implement them directly
>    or return other existing implementations. Here’s how it worked in the
>    old API
>    <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
>    .
>
> I hope that you don’t think I expect you to go “without seeing the design”!
>
> rb
>
> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:
>
>> Ryan,
>>
>> All the proposal I read is only related to Table metadata. Catalog
>> contains the metadata of database, functions, columns, views, and so on.
>> When we have multiple catalogs, how these catalogs interact with each
>> other? How the global catalog works? How a view, table, function, database
>> and column is resolved? Do we have nickname, mapping, wrapper?
>>
>> Or I might miss the design docs you send? Could you post the doc?
>>
>> Thanks,
>>
>> Xiao
>>
>>
>>
>>
>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>>
>>> Xiao,
>>>
>>> Please have a look at the pull requests and documents I've posted over
>>> the last few months.
>>>
>>> If you still have questions about how you might plug in Glue, let me
>>> know and I can clarify.
>>>
>>> rb
>>>
>>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> Ryan,
>>>>
>>>> Thanks for leading the discussion and sending out the memo!
>>>>
>>>>
>>>>> Xiao suggested that there are restrictions for how tables and
>>>>> functions interact. Because of this, he doesn’t think that separate
>>>>> TableCatalog and FunctionCatalog APIs are feasible.
>>>>
>>>>
>>>> Anything is possible. It depends on how we design the two interfaces.
>>>> Now, most parts are unknown to me without seeing the design.
>>>>
>>>> I think we need to see the user stories, and high-level design before
>>>> working on a small portion of Catalog federation. We do not need an
>>>> exhaustive design in the current stage, but we need to know how the new
>>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>>> plug in a Glue? How do users implement a new external catalog without
>>>> adding any new data sources? Without knowing more details, it is hard to
>>>> say whether this TableCatalog can satisfy all the requirements.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>>
>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Here are my notes from last night’s sync. Some attendees that joined
>>>>> during discussion may be missing, since I made the list while we were
>>>>> waiting for people to join.
>>>>>
>>>>> If you have topic suggestions for the next sync, please start sending
>>>>> them to me. Thank you!
>>>>>
>>>>> *Attendees:*
>>>>>
>>>>> Ryan Blue
>>>>> John Zhuge
>>>>> Jamison Bennett
>>>>> Yuanjian Li
>>>>> Xiao Li
>>>>> stczwd
>>>>> Matt Cheah
>>>>> Wenchen Fan
>>>>> Genglian Wang
>>>>> Kevin Yu
>>>>> Maryann Xue
>>>>> Cody Koeninger
>>>>> Bruce Robbins
>>>>> Rohit Karlupia
>>>>>
>>>>> *Agenda:*
>>>>>
>>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>>    - TableCatalog proposal
>>>>>    - CatalogTableIdentifier
>>>>>
>>>>> *Notes:*
>>>>>
>>>>>    - Discussion about PR #23086
>>>>>       - Where should the catalog API live since it needs to be
>>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>>       interfaces, making catalyst depend on it
>>>>>       - Consensus was to use Wenchen’s suggestion
>>>>>    - In discussion about #23086, Xiao asked how adding catalog to a
>>>>>    table identifier will work
>>>>>       - Background from Ryan: existing code paths use TableIdentifier
>>>>>       and don’t expect a catalog portion. If an identifier with a catalog were
>>>>>       passed to existing code, that code may use the default catalog not knowing
>>>>>       that a different one was requested, which would be incorrect behavior.
>>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses this
>>>>>       problem. TableIdentifier is used for identifiers that have no catalog set.
>>>>>       By enforcing that requirement, passing a TableIdentifier to old code
>>>>>       ensures that no catalogs leak into that code. This is also used when the
>>>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>>>       TableIdentifier because the catalog is already determined.
>>>>>    - Xiao asked whether FunctionIdentifier needs to be updated in the
>>>>>    same way as CatalogTableIdentifier.
>>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>>    - The remaining time was spent discussing whether the plan to
>>>>>    incrementally replace the current catalog API will work. [Not great notes
>>>>>    here, feel free to add your take in a reply]
>>>>>       - Xiao suggested that there are restrictions for how tables and
>>>>>       functions interact. Because of this, he doesn’t think that separate
>>>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>>>       - Wenchen and Ryan think that functions should be orthogonal to
>>>>>       data sources
>>>>>       - Matt and Ryan think that catalog design can be done
>>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>>>       later
>>>>>       - [I forget who] pointed out that there are restrictions in
>>>>>       some databases for views from different sources
>>>>>       - There was some discussion about when functions or views
>>>>>       cannot be orthogonal. For example, where the code runs is important.
>>>>>       Functions pushed to sources cannot necessarily be run on other sources and
>>>>>       Spark functions cannot necessarily be pushed down to sources.
>>>>>       - Xiao would like a full catalog replacement design, including
>>>>>       views, databases, and functions and how they interact, before moving
>>>>>       forward with the proposed TableCatalog API
>>>>>       - Ryan [and Matt, I think] think that TableCatalog is
>>>>>       compatible with future decisions and the best path forward is to build
>>>>>       incrementally. An exhaustive design process blocks progress on v2.
>>>>>
>>>>>
>>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>>
>>>>>> We have a few topics left over from last time to cover. A few people
>>>>>> wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>>
>>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths
>>>>>>    and avoid unintended behavior changes
>>>>>>
>>>>>> As I noted in the summary last time, please send topics ahead of time
>>>>>> so we can get started more quickly.
>>>>>>
>>>>>> If you would like to be added to the google hangout invite, please
>>>>>> let me know and I’ll add you. Thanks!
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Xiao,

For the questions in this last email about how catalogs interact and how
functions and other future features work: we discussed those last night. As
I said then, I think that the right approach is incremental. We don’t want
to design all of that in one gigantic proposal up front. To do that is to
put ourselves into analysis paralysis.

We don’t have a design for how catalogs interact with one another, but I
think we made a strong case for two points: first, that the proposed
structure doesn’t preclude any of those future decisions (hence we should
proceed incrementally). Second, that those situations aren’t that hard to
think through if you’re concerned about them: functions that can run in
Spark can be run on any data, functions that run in external sources cannot
be run on any data.

You’re right that I haven’t completely covered your *new* questions. But to
the questions in your first email:

   - You asked how, for example, Glue may be plugged in. That is well
   covered in the PR that adds catalogs as a plugin
   <https://github.com/apache/spark/pull/21306#issue-187572913>, the
   response I sent to Wenchen’s questions, and the earlier discussion thread I
   posted to this list with the subject “[DISCUSS] Multiple catalog support”.
   The short answer is that implementations are configured with Spark config
   properties and loaded with reflection.
   - You asked how users implement an external catalog without adding new
   data sources. That’s also covered in the “Multiple catalog support”
   proposal, the table catalog PR, and ongoing discussions on the v2 redesign.
   The answer is that a catalog returns a table instance that implements the
   various interfaces from Wenchen’s work. A table may implement them directly
   or return other existing implementations. Here’s how it worked in the
   old API
   <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
   .

I hope that you don’t think I expect you to go “without seeing the design”!

rb

On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <ga...@gmail.com> wrote:

> Ryan,
>
> All the proposal I read is only related to Table metadata. Catalog
> contains the metadata of database, functions, columns, views, and so on.
> When we have multiple catalogs, how these catalogs interact with each
> other? How the global catalog works? How a view, table, function, database
> and column is resolved? Do we have nickname, mapping, wrapper?
>
> Or I might miss the design docs you send? Could you post the doc?
>
> Thanks,
>
> Xiao
>
>
>
>
> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>
>> Xiao,
>>
>> Please have a look at the pull requests and documents I've posted over
>> the last few months.
>>
>> If you still have questions about how you might plug in Glue, let me know
>> and I can clarify.
>>
>> rb
>>
>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> Ryan,
>>>
>>> Thanks for leading the discussion and sending out the memo!
>>>
>>>
>>>> Xiao suggested that there are restrictions for how tables and functions
>>>> interact. Because of this, he doesn’t think that separate TableCatalog and
>>>> FunctionCatalog APIs are feasible.
>>>
>>>
>>> Anything is possible. It depends on how we design the two interfaces.
>>> Now, most parts are unknown to me without seeing the design.
>>>
>>> I think we need to see the user stories, and high-level design before
>>> working on a small portion of Catalog federation. We do not need an
>>> exhaustive design in the current stage, but we need to know how the new
>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>> plug in a Glue? How do users implement a new external catalog without
>>> adding any new data sources? Without knowing more details, it is hard to
>>> say whether this TableCatalog can satisfy all the requirements.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>
>>>> Hi everyone,
>>>>
>>>> Here are my notes from last night’s sync. Some attendees that joined
>>>> during discussion may be missing, since I made the list while we were
>>>> waiting for people to join.
>>>>
>>>> If you have topic suggestions for the next sync, please start sending
>>>> them to me. Thank you!
>>>>
>>>> *Attendees:*
>>>>
>>>> Ryan Blue
>>>> John Zhuge
>>>> Jamison Bennett
>>>> Yuanjian Li
>>>> Xiao Li
>>>> stczwd
>>>> Matt Cheah
>>>> Wenchen Fan
>>>> Genglian Wang
>>>> Kevin Yu
>>>> Maryann Xue
>>>> Cody Koeninger
>>>> Bruce Robbins
>>>> Rohit Karlupia
>>>>
>>>> *Agenda:*
>>>>
>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>    - TableCatalog proposal
>>>>    - CatalogTableIdentifier
>>>>
>>>> *Notes:*
>>>>
>>>>    - Discussion about PR #23086
>>>>       - Where should the catalog API live since it needs to be
>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>       interfaces, making catalyst depend on it
>>>>       - Consensus was to use Wenchen’s suggestion
>>>>    - In discussion about #23086, Xiao asked how adding catalog to a
>>>>    table identifier will work
>>>>       - Background from Ryan: existing code paths use TableIdentifier
>>>>       and don’t expect a catalog portion. If an identifier with a catalog were
>>>>       passed to existing code, that code may use the default catalog not knowing
>>>>       that a different one was requested, which would be incorrect behavior.
>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses this
>>>>       problem. TableIdentifier is used for identifiers that have no catalog set.
>>>>       By enforcing that requirement, passing a TableIdentifier to old code
>>>>       ensures that no catalogs leak into that code. This is also used when the
>>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>>       TableIdentifier because the catalog is already determined.
>>>>    - Xiao asked whether FunctionIdentifier needs to be updated in the
>>>>    same way as CatalogTableIdentifier.
>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>    - The remaining time was spent discussing whether the plan to
>>>>    incrementally replace the current catalog API will work. [Not great notes
>>>>    here, feel free to add your take in a reply]
>>>>       - Xiao suggested that there are restrictions for how tables and
>>>>       functions interact. Because of this, he doesn’t think that separate
>>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>>       - Wenchen and Ryan think that functions should be orthogonal to
>>>>       data sources
>>>>       - Matt and Ryan think that catalog design can be done
>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>>       later
>>>>       - [I forget who] pointed out that there are restrictions in some
>>>>       databases for views from different sources
>>>>       - There was some discussion about when functions or views cannot
>>>>       be orthogonal. For example, where the code runs is important. Functions
>>>>       pushed to sources cannot necessarily be run on other sources and Spark
>>>>       functions cannot necessarily be pushed down to sources.
>>>>       - Xiao would like a full catalog replacement design, including
>>>>       views, databases, and functions and how they interact, before moving
>>>>       forward with the proposed TableCatalog API
>>>>       - Ryan [and Matt, I think] think that TableCatalog is compatible
>>>>       with future decisions and the best path forward is to build incrementally.
>>>>       An exhaustive design process blocks progress on v2.
>>>>
>>>>
>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>
>>>>> We have a few topics left over from last time to cover. A few people
>>>>> wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>
>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths
>>>>>    and avoid unintended behavior changes
>>>>>
>>>>> As I noted in the summary last time, please send topics ahead of time
>>>>> so we can get started more quickly.
>>>>>
>>>>> If you would like to be added to the google hangout invite, please let
>>>>> me know and I’ll add you. Thanks!
>>>>>
>>>>> rb
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Xiao Li <ga...@gmail.com>.

Ryan,

All the proposal I read is only related to Table metadata. Catalog contains
the metadata of database, functions, columns, views, and so on. When we
have multiple catalogs, how these catalogs interact with each other? How
the global catalog works? How a view, table, function, database and column
is resolved? Do we have nickname, mapping, wrapper?

Or I might miss the design docs you send? Could you post the doc?

Thanks,

Xiao




Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：

> Xiao,
>
> Please have a look at the pull requests and documents I've posted over the
> last few months.
>
> If you still have questions about how you might plug in Glue, let me know
> and I can clarify.
>
> rb
>
> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:
>
>> Ryan,
>>
>> Thanks for leading the discussion and sending out the memo!
>>
>>
>>> Xiao suggested that there are restrictions for how tables and functions
>>> interact. Because of this, he doesn’t think that separate TableCatalog and
>>> FunctionCatalog APIs are feasible.
>>
>>
>> Anything is possible. It depends on how we design the two interfaces.
>> Now, most parts are unknown to me without seeing the design.
>>
>> I think we need to see the user stories, and high-level design before
>> working on a small portion of Catalog federation. We do not need an
>> exhaustive design in the current stage, but we need to know how the new
>> proposal works. For example, how to plug in a new Hive metastore? How to
>> plug in a Glue? How do users implement a new external catalog without
>> adding any new data sources? Without knowing more details, it is hard to
>> say whether this TableCatalog can satisfy all the requirements.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>
>>> Hi everyone,
>>>
>>> Here are my notes from last night’s sync. Some attendees that joined
>>> during discussion may be missing, since I made the list while we were
>>> waiting for people to join.
>>>
>>> If you have topic suggestions for the next sync, please start sending
>>> them to me. Thank you!
>>>
>>> *Attendees:*
>>>
>>> Ryan Blue
>>> John Zhuge
>>> Jamison Bennett
>>> Yuanjian Li
>>> Xiao Li
>>> stczwd
>>> Matt Cheah
>>> Wenchen Fan
>>> Genglian Wang
>>> Kevin Yu
>>> Maryann Xue
>>> Cody Koeninger
>>> Bruce Robbins
>>> Rohit Karlupia
>>>
>>> *Agenda:*
>>>
>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>    - TableCatalog proposal
>>>    - CatalogTableIdentifier
>>>
>>> *Notes:*
>>>
>>>    - Discussion about PR #23086
>>>       - Where should the catalog API live since it needs to be
>>>       accessible to catalyst rules, but the catalyst module is private?
>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>       interfaces, making catalyst depend on it
>>>       - Consensus was to use Wenchen’s suggestion
>>>    - In discussion about #23086, Xiao asked how adding catalog to a
>>>    table identifier will work
>>>       - Background from Ryan: existing code paths use TableIdentifier
>>>       and don’t expect a catalog portion. If an identifier with a catalog were
>>>       passed to existing code, that code may use the default catalog not knowing
>>>       that a different one was requested, which would be incorrect behavior.
>>>       - Ryan: The proposal for CatalogTableIdentifier addresses this
>>>       problem. TableIdentifier is used for identifiers that have no catalog set.
>>>       By enforcing that requirement, passing a TableIdentifier to old code
>>>       ensures that no catalogs leak into that code. This is also used when the
>>>       catalog is set from context. For example, the TableCatalog API accepts only
>>>       TableIdentifier because the catalog is already determined.
>>>    - Xiao asked whether FunctionIdentifier needs to be updated in the
>>>    same way as CatalogTableIdentifier.
>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>    - The remaining time was spent discussing whether the plan to
>>>    incrementally replace the current catalog API will work. [Not great notes
>>>    here, feel free to add your take in a reply]
>>>       - Xiao suggested that there are restrictions for how tables and
>>>       functions interact. Because of this, he doesn’t think that separate
>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>       - Wenchen and Ryan think that functions should be orthogonal to
>>>       data sources
>>>       - Matt and Ryan think that catalog design can be done
>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>>       later
>>>       - [I forget who] pointed out that there are restrictions in some
>>>       databases for views from different sources
>>>       - There was some discussion about when functions or views cannot
>>>       be orthogonal. For example, where the code runs is important. Functions
>>>       pushed to sources cannot necessarily be run on other sources and Spark
>>>       functions cannot necessarily be pushed down to sources.
>>>       - Xiao would like a full catalog replacement design, including
>>>       views, databases, and functions and how they interact, before moving
>>>       forward with the proposed TableCatalog API
>>>       - Ryan [and Matt, I think] think that TableCatalog is compatible
>>>       with future decisions and the best path forward is to build incrementally.
>>>       An exhaustive design process blocks progress on v2.
>>>
>>>
>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I just sent out an invite for the next DSv2 community sync for
>>>> Wednesday, 28 Nov at 5PM PST.
>>>>
>>>> We have a few topics left over from last time to cover. A few people
>>>> wanted to cover catalog APIs, so I put two items on the agenda:
>>>>
>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths and
>>>>    avoid unintended behavior changes
>>>>
>>>> As I noted in the summary last time, please send topics ahead of time
>>>> so we can get started more quickly.
>>>>
>>>> If you would like to be added to the google hangout invite, please let
>>>> me know and I’ll add you. Thanks!
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Xiao,

Please have a look at the pull requests and documents I've posted over the
last few months.

If you still have questions about how you might plug in Glue, let me know
and I can clarify.

rb

On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <ga...@gmail.com> wrote:

> Ryan,
>
> Thanks for leading the discussion and sending out the memo!
>
>
>> Xiao suggested that there are restrictions for how tables and functions
>> interact. Because of this, he doesn’t think that separate TableCatalog and
>> FunctionCatalog APIs are feasible.
>
>
> Anything is possible. It depends on how we design the two interfaces. Now,
> most parts are unknown to me without seeing the design.
>
> I think we need to see the user stories, and high-level design before
> working on a small portion of Catalog federation. We do not need an
> exhaustive design in the current stage, but we need to know how the new
> proposal works. For example, how to plug in a new Hive metastore? How to
> plug in a Glue? How do users implement a new external catalog without
> adding any new data sources? Without knowing more details, it is hard to
> say whether this TableCatalog can satisfy all the requirements.
>
> Cheers,
>
> Xiao
>
>
> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>
>> Hi everyone,
>>
>> Here are my notes from last night’s sync. Some attendees that joined
>> during discussion may be missing, since I made the list while we were
>> waiting for people to join.
>>
>> If you have topic suggestions for the next sync, please start sending
>> them to me. Thank you!
>>
>> *Attendees:*
>>
>> Ryan Blue
>> John Zhuge
>> Jamison Bennett
>> Yuanjian Li
>> Xiao Li
>> stczwd
>> Matt Cheah
>> Wenchen Fan
>> Genglian Wang
>> Kevin Yu
>> Maryann Xue
>> Cody Koeninger
>> Bruce Robbins
>> Rohit Karlupia
>>
>> *Agenda:*
>>
>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>    - TableCatalog proposal
>>    - CatalogTableIdentifier
>>
>> *Notes:*
>>
>>    - Discussion about PR #23086
>>       - Where should the catalog API live since it needs to be
>>       accessible to catalyst rules, but the catalyst module is private?
>>       - Wenchen suggested creating a sql-api module for v2 API
>>       interfaces, making catalyst depend on it
>>       - Consensus was to use Wenchen’s suggestion
>>    - In discussion about #23086, Xiao asked how adding catalog to a
>>    table identifier will work
>>       - Background from Ryan: existing code paths use TableIdentifier
>>       and don’t expect a catalog portion. If an identifier with a catalog were
>>       passed to existing code, that code may use the default catalog not knowing
>>       that a different one was requested, which would be incorrect behavior.
>>       - Ryan: The proposal for CatalogTableIdentifier addresses this
>>       problem. TableIdentifier is used for identifiers that have no catalog set.
>>       By enforcing that requirement, passing a TableIdentifier to old code
>>       ensures that no catalogs leak into that code. This is also used when the
>>       catalog is set from context. For example, the TableCatalog API accepts only
>>       TableIdentifier because the catalog is already determined.
>>    - Xiao asked whether FunctionIdentifier needs to be updated in the
>>    same way as CatalogTableIdentifier.
>>       - Ryan: Yes, when a FunctionCatalog API is added
>>    - The remaining time was spent discussing whether the plan to
>>    incrementally replace the current catalog API will work. [Not great notes
>>    here, feel free to add your take in a reply]
>>       - Xiao suggested that there are restrictions for how tables and
>>       functions interact. Because of this, he doesn’t think that separate
>>       TableCatalog and FunctionCatalog APIs are feasible.
>>       - Wenchen and Ryan think that functions should be orthogonal to
>>       data sources
>>       - Matt and Ryan think that catalog design can be done
>>       incrementally as new interfaces (i.e. FunctionCatalog) are added and that
>>       the proposed TableCatalog does not preclude designing for Xiao’s concerns
>>       later
>>       - [I forget who] pointed out that there are restrictions in some
>>       databases for views from different sources
>>       - There was some discussion about when functions or views cannot
>>       be orthogonal. For example, where the code runs is important. Functions
>>       pushed to sources cannot necessarily be run on other sources and Spark
>>       functions cannot necessarily be pushed down to sources.
>>       - Xiao would like a full catalog replacement design, including
>>       views, databases, and functions and how they interact, before moving
>>       forward with the proposed TableCatalog API
>>       - Ryan [and Matt, I think] think that TableCatalog is compatible
>>       with future decisions and the best path forward is to build incrementally.
>>       An exhaustive design process blocks progress on v2.
>>
>>
>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I just sent out an invite for the next DSv2 community sync for
>>> Wednesday, 28 Nov at 5PM PST.
>>>
>>> We have a few topics left over from last time to cover. A few people
>>> wanted to cover catalog APIs, so I put two items on the agenda:
>>>
>>>    - The TableCatalog proposal (and other catalog APIs)
>>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths and
>>>    avoid unintended behavior changes
>>>
>>> As I noted in the summary last time, please send topics ahead of time so
>>> we can get started more quickly.
>>>
>>> If you would like to be added to the google hangout invite, please let
>>> me know and I’ll add you. Thanks!
>>>
>>> rb
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Xiao Li <ga...@gmail.com>.

Ryan,

Thanks for leading the discussion and sending out the memo!


> Xiao suggested that there are restrictions for how tables and functions
> interact. Because of this, he doesn’t think that separate TableCatalog and
> FunctionCatalog APIs are feasible.


Anything is possible. It depends on how we design the two interfaces. Now,
most parts are unknown to me without seeing the design.

I think we need to see the user stories, and high-level design before
working on a small portion of Catalog federation. We do not need an
exhaustive design in the current stage, but we need to know how the new
proposal works. For example, how to plug in a new Hive metastore? How to
plug in a Glue? How do users implement a new external catalog without
adding any new data sources? Without knowing more details, it is hard to
say whether this TableCatalog can satisfy all the requirements.

Cheers,

Xiao


Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：

> Hi everyone,
>
> Here are my notes from last night’s sync. Some attendees that joined
> during discussion may be missing, since I made the list while we were
> waiting for people to join.
>
> If you have topic suggestions for the next sync, please start sending them
> to me. Thank you!
>
> *Attendees:*
>
> Ryan Blue
> John Zhuge
> Jamison Bennett
> Yuanjian Li
> Xiao Li
> stczwd
> Matt Cheah
> Wenchen Fan
> Genglian Wang
> Kevin Yu
> Maryann Xue
> Cody Koeninger
> Bruce Robbins
> Rohit Karlupia
>
> *Agenda:*
>
>    - Follow-up issues or discussion on Wenchen’s PR #23086
>    - TableCatalog proposal
>    - CatalogTableIdentifier
>
> *Notes:*
>
>    - Discussion about PR #23086
>       - Where should the catalog API live since it needs to be accessible
>       to catalyst rules, but the catalyst module is private?
>       - Wenchen suggested creating a sql-api module for v2 API
>       interfaces, making catalyst depend on it
>       - Consensus was to use Wenchen’s suggestion
>    - In discussion about #23086, Xiao asked how adding catalog to a table
>    identifier will work
>       - Background from Ryan: existing code paths use TableIdentifier and
>       don’t expect a catalog portion. If an identifier with a catalog were passed
>       to existing code, that code may use the default catalog not knowing that a
>       different one was requested, which would be incorrect behavior.
>       - Ryan: The proposal for CatalogTableIdentifier addresses this
>       problem. TableIdentifier is used for identifiers that have no catalog set.
>       By enforcing that requirement, passing a TableIdentifier to old code
>       ensures that no catalogs leak into that code. This is also used when the
>       catalog is set from context. For example, the TableCatalog API accepts only
>       TableIdentifier because the catalog is already determined.
>    - Xiao asked whether FunctionIdentifier needs to be updated in the
>    same way as CatalogTableIdentifier.
>       - Ryan: Yes, when a FunctionCatalog API is added
>    - The remaining time was spent discussing whether the plan to
>    incrementally replace the current catalog API will work. [Not great notes
>    here, feel free to add your take in a reply]
>       - Xiao suggested that there are restrictions for how tables and
>       functions interact. Because of this, he doesn’t think that separate
>       TableCatalog and FunctionCatalog APIs are feasible.
>       - Wenchen and Ryan think that functions should be orthogonal to
>       data sources
>       - Matt and Ryan think that catalog design can be done incrementally
>       as new interfaces (i.e. FunctionCatalog) are added and that the proposed
>       TableCatalog does not preclude designing for Xiao’s concerns later
>       - [I forget who] pointed out that there are restrictions in some
>       databases for views from different sources
>       - There was some discussion about when functions or views cannot be
>       orthogonal. For example, where the code runs is important. Functions pushed
>       to sources cannot necessarily be run on other sources and Spark functions
>       cannot necessarily be pushed down to sources.
>       - Xiao would like a full catalog replacement design, including
>       views, databases, and functions and how they interact, before moving
>       forward with the proposed TableCatalog API
>       - Ryan [and Matt, I think] think that TableCatalog is compatible
>       with future decisions and the best path forward is to build incrementally.
>       An exhaustive design process blocks progress on v2.
>
>
> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Hi everyone,
>>
>> I just sent out an invite for the next DSv2 community sync for Wednesday,
>> 28 Nov at 5PM PST.
>>
>> We have a few topics left over from last time to cover. A few people
>> wanted to cover catalog APIs, so I put two items on the agenda:
>>
>>    - The TableCatalog proposal (and other catalog APIs)
>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths and
>>    avoid unintended behavior changes
>>
>> As I noted in the summary last time, please send topics ahead of time so
>> we can get started more quickly.
>>
>> If you would like to be added to the google hangout invite, please let me
>> know and I’ll add you. Thanks!
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 community sync #3

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi everyone,

Here are my notes from last night’s sync. Some attendees that joined during
discussion may be missing, since I made the list while we were waiting for
people to join.

If you have topic suggestions for the next sync, please start sending them
to me. Thank you!

*Attendees:*

Ryan Blue
John Zhuge
Jamison Bennett
Yuanjian Li
Xiao Li
stczwd
Matt Cheah
Wenchen Fan
Genglian Wang
Kevin Yu
Maryann Xue
Cody Koeninger
Bruce Robbins
Rohit Karlupia

*Agenda:*

   - Follow-up issues or discussion on Wenchen’s PR #23086
   - TableCatalog proposal
   - CatalogTableIdentifier

*Notes:*

   - Discussion about PR #23086
      - Where should the catalog API live since it needs to be accessible
      to catalyst rules, but the catalyst module is private?
      - Wenchen suggested creating a sql-api module for v2 API interfaces,
      making catalyst depend on it
      - Consensus was to use Wenchen’s suggestion
   - In discussion about #23086, Xiao asked how adding catalog to a table
   identifier will work
      - Background from Ryan: existing code paths use TableIdentifier and
      don’t expect a catalog portion. If an identifier with a catalog
were passed
      to existing code, that code may use the default catalog not
knowing that a
      different one was requested, which would be incorrect behavior.
      - Ryan: The proposal for CatalogTableIdentifier addresses this
      problem. TableIdentifier is used for identifiers that have no
catalog set.
      By enforcing that requirement, passing a TableIdentifier to old code
      ensures that no catalogs leak into that code. This is also used when the
      catalog is set from context. For example, the TableCatalog API
accepts only
      TableIdentifier because the catalog is already determined.
   - Xiao asked whether FunctionIdentifier needs to be updated in the same
   way as CatalogTableIdentifier.
      - Ryan: Yes, when a FunctionCatalog API is added
   - The remaining time was spent discussing whether the plan to
   incrementally replace the current catalog API will work. [Not great notes
   here, feel free to add your take in a reply]
      - Xiao suggested that there are restrictions for how tables and
      functions interact. Because of this, he doesn’t think that separate
      TableCatalog and FunctionCatalog APIs are feasible.
      - Wenchen and Ryan think that functions should be orthogonal to data
      sources
      - Matt and Ryan think that catalog design can be done incrementally
      as new interfaces (i.e. FunctionCatalog) are added and that the proposed
      TableCatalog does not preclude designing for Xiao’s concerns later
      - [I forget who] pointed out that there are restrictions in some
      databases for views from different sources
      - There was some discussion about when functions or views cannot be
      orthogonal. For example, where the code runs is important.
Functions pushed
      to sources cannot necessarily be run on other sources and Spark functions
      cannot necessarily be pushed down to sources.
      - Xiao would like a full catalog replacement design, including views,
      databases, and functions and how they interact, before moving
forward with
      the proposed TableCatalog API
      - Ryan [and Matt, I think] think that TableCatalog is compatible with
      future decisions and the best path forward is to build incrementally. An
      exhaustive design process blocks progress on v2.


On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:

> Hi everyone,
>
> I just sent out an invite for the next DSv2 community sync for Wednesday,
> 28 Nov at 5PM PST.
>
> We have a few topics left over from last time to cover. A few people
> wanted to cover catalog APIs, so I put two items on the agenda:
>
>    - The TableCatalog proposal (and other catalog APIs)
>    - Using CatalogTableIdentifier to separate v1 and v2 code paths and
>    avoid unintended behavior changes
>
> As I noted in the summary last time, please send topics ahead of time so
> we can get started more quickly.
>
> If you would like to be added to the google hangout invite, please let me
> know and I’ll add you. Thanks!
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Xiao Li <ga...@gmail.com>.

Based on my understanding, we are not inventing anything new here.
Basically, we are building a federated database system especially after we
supporting multiple catalog. There are many mature commercial products in
the market. For example,
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.swg.im.iis.db.fed.overview.doc/topics/cfpint25.html
is the link for explaining the federated catalog in DB2.

Could we answer the difference and tradeoff in the new proposal, compared
with these existing system? If different, what is the motivation we do not
want to follow the traditional solution? If the same, do we have some
user-facing APIs?

Cheers,

Xiao



Wenchen Fan <cl...@gmail.com> 于2018年11月28日周三 下午6:41写道：

> Hi Ryan,
>
> Thanks for hosting the discussion! I think the table catalog is super
> useful, but since this is the first time we allow users to extend catalog,
> it's better to write down some details from end-user APIs to internal
> management.
> 1. How would end-users register/unregister catalog with SQL API and
> Scala/Java API?
> 2. How would end-users manage catalogs? like LIST CATALOGS, USE CATALOG
> xyz?
> 3. How to separate the abilities of catalog? Can we create a bunch of
> mixin triats for catalog API like SupportsTable, SupportsFunction,
> SupportsView, etc.?
> 4. How should Spark resolve identifies with catalog name? How to resolve
> ambiguity? What if the catalog doesn't support database? Can users write
> `catalogName.tblName` directly?
> 5. Where does Spark store the catalog list? In an in-memory map?
> 6. How to support atomic CTAS?
> 7. The data/schema of table may change over time, when should Spark
> determine the table content? During analysis or planning?
> 8. ...
>
> Since the catalog API is not only developer facing, but also user-facing,
> I think it's better to have a doc explaining what the developers concern
> and what the end users concern. The doc is also good for future reference,
> and can be used in release notes.
>
> Thanks,
> Wenchen
>
> On Wed, Nov 28, 2018 at 12:54 PM JackyLee <qc...@163.com> wrote:
>
>> +1
>>
>> Please add me to the Google Hangout invite.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Catalog API discussion (was: DataSourceV2 community sync #3)

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi Wenchen,
I’ll add my responses inline. The answers are based on the proposed
TableCatalog API:

   - SPIP: Spark table metadata
   <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
   - PR #21306 <https://github.com/apache/spark/pull/21306>

On Wed, Nov 28, 2018 at 6:41 PM Wenchen Fan cloud0fan@gmail.com
<ht...@gmail.com> wrote:

Thanks for hosting the discussion! I think the table catalog is super
> useful, but since this is the first time we allow users to extend catalog,
> it's better to write down some details from end-user APIs to internal
> management.
> 1. How would end-users register/unregister catalog with SQL API and
> Scala/Java API?
>
In the PR, users or administrators create catalogs by setting properties in
the SQL conf. To create and configure a test catalog implemented by
SomeCatalogClass, it looks like this:

spark.sql.catalog.test = com.example.SomeCatalogClass
spark.sql.catalog.test.config-var = value

For example, we have our own catalog, metacat, and we pass a service URI to
it and a property to tell it to use “prod” or “test” tables.

2. How would end-users manage catalogs? like LIST CATALOGS, USE CATALOG xyz?
>
Users and administrators can configure catalogs using properties like I
mentioned above. We could also implement the SQL statements like you
describe here.

Presto uses SHOW CATALOGS [LIKE prefix].

3. How to separate the abilities of catalog? Can we create a bunch of mixin
> triats for catalog API like SupportsTable, SupportsFunction, SupportsView,
> etc.?
>
What I’ve proposed is a base class, CatalogProvider
<https://github.com/apache/spark/pull/21306/files#diff-81c54123a7549b07a9d627353d9cbf95>,
that all catalogs inherit from. A CatalogProvider can be loaded as I
described above and is passed configuration through an initialize method.

Catalog implementations would also implement interfaces that carry a set of
methods for some task. What I’ve proposed is TableCatalog
<https://github.com/apache/spark/pull/21306/files#diff-a06043294c1e2c49a34aa0356f9e5450>
that exposes methods from the Table metadata APIs SPIP.

When a TableCatalog is used in a DDL statement like DROP TABLE, for
example, an analysis rule matches the raw SQL plan, resolves/loads the
catalog, and checks that it is a TableCatalog. Then passes on a logical
plan with the right catalog type:

case class DropTable(catalog: TableCatalog, table: TableIdentifier,
ifExists: Boolean) extends Command

4. How should Spark resolve identifies with catalog name? How to resolve
> ambiguity? What if the catalog doesn't support database? Can users write
> `catalogName.tblName` directly?
>
In #21978 <https://github.com/apache/spark/pull/21978>, I proposed
CatalogTableIdentifier that passes catalog, database, and table name. The
easiest and safest answer is to fill in a “current” catalog when missing
(just like the “current” database) and always interpret 2 identifiers as
database and table, never catalog and table.

How Spark decides to do this is really orthogonal to the catalog API.

5. Where does Spark store the catalog list? In an in-memory map?
>
SparkSession tracks catalog instances. Each catalog is loaded once (unless
we add some statement to reload) and cached in the session. The session is
how the current global catalog is accessed as well.

Another reason why catalogs are session-specific is that they can hold
important session-specific state. For example, Iceberg’s catalog caches
tables when loaded so that the same snapshot of a table is used for all
reads in a query. Not all table formats support this, so it is optional.

6. How to support atomic CTAS?
>
The plan we’ve discussed is to create tables with “staged” changes (see the
SPIP doc). When the write operation commits, all of the changes are
committed at once. I’m flexible on this and I think we have room for other
options as well. The current proposal only covers non-atomic CTAS.

7. The data/schema of table may change over time, when should Spark
> determine the table content? During analysis or planning?
>
Spark loads the table from a catalog during resolution rules, just like it
does with the global catalog now.

8. ...
>
> Since the catalog API is not only developer facing, but also user-facing,
> I think it's better to have a doc explaining what the developers concern
> and what the end users concern. The doc is also good for future reference,
> and can be used in release notes.
>
If you think the SPIP that I posed to the list in April needs extra
information, please let me know.

rb
-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Posted by Wenchen Fan <cl...@gmail.com>.

Hi Ryan,

Thanks for hosting the discussion! I think the table catalog is super
useful, but since this is the first time we allow users to extend catalog,
it's better to write down some details from end-user APIs to internal
management.
1. How would end-users register/unregister catalog with SQL API and
Scala/Java API?
2. How would end-users manage catalogs? like LIST CATALOGS, USE CATALOG xyz?
3. How to separate the abilities of catalog? Can we create a bunch of mixin
triats for catalog API like SupportsTable, SupportsFunction, SupportsView,
etc.?
4. How should Spark resolve identifies with catalog name? How to resolve
ambiguity? What if the catalog doesn't support database? Can users write
`catalogName.tblName` directly?
5. Where does Spark store the catalog list? In an in-memory map?
6. How to support atomic CTAS?
7. The data/schema of table may change over time, when should Spark
determine the table content? During analysis or planning?
8. ...

Since the catalog API is not only developer facing, but also user-facing, I
think it's better to have a doc explaining what the developers concern and
what the end users concern. The doc is also good for future reference, and
can be used in release notes.

Thanks,
Wenchen

On Wed, Nov 28, 2018 at 12:54 PM JackyLee <qc...@163.com> wrote:

> +1
>
> Please add me to the Google Hangout invite.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: DataSourceV2 community sync #3

Posted by JackyLee <qc...@163.com>.

+1

Please add me to the Google Hangout invite. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: DataSourceV2 community sync #3

Posted by Martin Junghanns <ma...@neotechnology.com>.

Hi Ryan,

I would like to be added to the Google Hangout invite. Thank you.

Cheers,

Martin

On 26.11.18 23:54, Ryan Blue wrote:
>
> Hi everyone,
>
> I just sent out an invite for the next DSv2 community sync for 
> Wednesday, 28 Nov at 5PM PST.
>
> We have a few topics left over from last time to cover. A few people 
> wanted to cover catalog APIs, so I put two items on the agenda:
>
>   * The TableCatalog proposal (and other catalog APIs)
>   * Using CatalogTableIdentifier to separate v1 and v2 code paths and
>     avoid unintended behavior changes
>
> As I noted in the summary last time, please send topics ahead of time 
> so we can get started more quickly.
>
> If you would like to be added to the google hangout invite, please let 
> me know and I’ll add you. Thanks!
>
> rb
>
> -- 
> Ryan Blue
> Software Engineer
> Netflix