You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by John Zhuge <jz...@apache.org> on 2020/08/11 22:20:46 UTC

Re: SPIP: Catalog API for view metadata

Hi Spark devs,

I'd like to bring more attention to this SPIP. As Dongjoon indicated in the
email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature can
be considered for 3.2 or even 3.1.

View catalog builds on top of the catalog plugin system introduced in
DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
drop views. A catalog plugin can naturally implement both ViewCatalog and
TableCatalog.

Our internal implementation has been in production for over 8 months.
Recently we extended it to support materialized views, for the read path
initially.

The PR has conflicts that I will resolve them shortly.

Thanks,

On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org> wrote:

> Hi everyone,
>
> In order to disassociate view metadata from Hive Metastore and support
> different storage backends, I am proposing a new view catalog API to load,
> create, alter, and drop views.
>
> Document:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
> WIP PR: https://github.com/apache/spark/pull/28147
>
> As part of a project to support common views across query engines like
> Spark and Presto, my team used the view catalog API in Spark
> implementation. The project has been in production over three months.
>
> Thanks,
> John Zhuge
>


-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Thanks Wenchen. Will do.

On Tue, Aug 18, 2020 at 6:38 AM Wenchen Fan <cl...@gmail.com> wrote:

> > AFAIK view schema is only used by DESCRIBE.
>
> Correction: Spark adds a new Project at the top of the parsed plan from
> view, based on the stored schema, to make sure the view schema doesn't
> change.
>
> Can you update your doc to incorporate the cache idea? Let's make sure we
> don't have perf issues if we go with the new View API.
>
> On Tue, Aug 18, 2020 at 4:25 PM John Zhuge <jz...@apache.org> wrote:
>
>> Thanks Burak and Walaa for the feedback!
>>
>> Here are my perspectives:
>>
>> We shouldn't be persisting things like the schema for a view
>>
>>
>> This is not related to which option to choose because existing code
>> persists schema as well.
>> When resolving the view, the analyzer always parses the view sql text, it
>> does not use the schema.
>>
>>> AFAIK view schema is only used by DESCRIBE.
>>
>>
>>> Why not use TableCatalog.loadTable to load both tables and views
>>>
>> Also, views can be defined on top of either other views or base tables,
>>> so the less divergence in code paths between views and tables the better.
>>
>>
>> Existing Spark takes this approach and there are quite a few checks like
>> "tableType == CatalogTableType.VIEW".
>> View and table metadata surprisingly have very little in common, thus I'd
>> like to group view related code together, separate from table processing.
>> Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution
>> in the same "Substitution" batch as CTESubstitution.
>>
>> This way you avoid multiple RPCs to a catalog or data source or
>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>> less susceptible to race conditions (which still inherently exist).
>>>
>>
>> Valid concern. Can be mitigated by caching RPC calls in the catalog
>> implementation. The window for race condition can also be narrowed
>> significantly but not totally eliminated.
>>
>>
>> On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa <
>> wa.moustafa@gmail.com> wrote:
>>
>>> Wenchen, agreed with what you said. I was referring to situations where
>>> the underlying table schema evolves (say by introducing a nested field in a
>>> Struct), and also what you mentioned in cases of SELECT *. The Hive
>>> metastore handling of those does not automatically update view schema (even
>>> though executing the view in Hive results in data that has the most recent
>>> schema when underlying tables evolve -- so newly added nested field data
>>> shows up in the view evaluation query result but not in the view schema).
>>>
>>> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> View should have a fixed schema like a table. It should either be
>>>> inferred from the query when creating the view, or be specified by the user
>>>> manually like CREATE VIEW v(a, b) AS SELECT.... Users can still alter
>>>> view schema manually.
>>>>
>>>> Basically a view is just a named SQL query, which mostly has fixed
>>>> schema unless you do something like SELECT *.
>>>>
>>>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
>>>> wa.moustafa@gmail.com> wrote:
>>>>
>>>>> +1 to making views as special forms of tables. Sometimes a table can
>>>>> be converted to a view to hide some of the implementation details while not
>>>>> impacting readers (provided that the write path is controlled). Also, views
>>>>> can be defined on top of either other views or base tables, so the less
>>>>> divergence in code paths between views and tables the better.
>>>>>
>>>>> For whether to materialize view schema or infer it, one of the issues
>>>>> we face with the HMS approach of materialization is that when the
>>>>> underlying table schema evolves, HMS will still keep the view schema
>>>>> unchanged. This causes a number of discrepancies that we address
>>>>> out-of-band (e.g., run separate pipeline to ensure view schema freshness,
>>>>> or just re-derive it at read time (example derivation algorithm for
>>>>> view Avro schema
>>>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>>>>> )).
>>>>>
>>>>> Also regarding SupportsRead vs SupportWrite, some views can be
>>>>> updateable (example from MySQL
>>>>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but
>>>>> also implementing that requires a few concepts that are more prominent in
>>>>> an RDBMS.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <br...@gmail.com> wrote:
>>>>>
>>>>>> My high level comment here is that as a naive person, I would expect
>>>>>> a View to be a special form of Table that SupportsRead but doesn't
>>>>>> SupportWrite. loadTable in the TableCatalog API should load both tables and
>>>>>> views. This way you avoid multiple RPCs to a catalog or data source or
>>>>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>>>>> less susceptible to race conditions (which still inherently exist).
>>>>>>
>>>>>> In addition, I'm not a SQL expert, but I thought that views are
>>>>>> evaluated at runtime, therefore we shouldn't be persisting things like the
>>>>>> schema for a view.
>>>>>>
>>>>>> What do people think of making Views a special form of Table?
>>>>>>
>>>>>> Best,
>>>>>> Burak
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jz...@apache.org> wrote:
>>>>>>
>>>>>>> Thanks Ryan.
>>>>>>>
>>>>>>> ViewCatalog API mimics TableCatalog API including how shared
>>>>>>> namespace is handled:
>>>>>>>
>>>>>>>    - The doc for createView
>>>>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> states
>>>>>>>    "it will throw ViewAlreadyExistsException when a view or table already
>>>>>>>    exists for the identifier."
>>>>>>>    - The doc for loadView
>>>>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> states
>>>>>>>    "If the catalog supports tables and contains a table for the identifier and
>>>>>>>    not a view, this must throw NoSuchViewException."
>>>>>>>
>>>>>>> Agree it is good to explicitly specify the order of resolution. I
>>>>>>> will add a section in ViewCatalog javadoc to summarize the behavior for
>>>>>>> "shared namespace". The loadView doc will also be updated to spell out the
>>>>>>> order of resolution.
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree with Wenchen that we need to be clear about resolution and
>>>>>>>> behavior. For example, I think that we would agree that CREATE
>>>>>>>> VIEW catalog.schema.name should fail when there is a table named
>>>>>>>> catalog.schema.name. We’ve already included this behavior in the
>>>>>>>> documentation for the TableCatalog API
>>>>>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>>>>>>> where create should fail if a view exists for the identifier.
>>>>>>>>
>>>>>>>>
>>>>>>>> I think it was simply assumed that we would use the same approach —
>>>>>>>> the API requires that table and view names share a namespace. But it would
>>>>>>>> be good to specifically note either the order in which resolution will
>>>>>>>> happen (views are resolved first) or note that it is not allowed and
>>>>>>>> behavior is not guaranteed. I prefer the first option.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Wenchen,
>>>>>>>>>
>>>>>>>>> Thanks for the feedback!
>>>>>>>>>
>>>>>>>>> 1. Add a new View API. How to avoid name conflicts between table
>>>>>>>>>> and view? When resolving relation, shall we lookup table catalog first or
>>>>>>>>>> view catalog?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>>>>>>
>>>>>>>>>    - The proposed new view substitution rule and the changes to
>>>>>>>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>>>>>>>    "dual" catalog.
>>>>>>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>>>>>>       -  Creating a view in view catalog when a table of the same
>>>>>>>>>       name exists should fail.
>>>>>>>>>       -  Creating a table in table catalog when a view of the
>>>>>>>>>       same name exists should fail as well.
>>>>>>>>>
>>>>>>>>> Agree with you that a new View API is more flexible. A couple of
>>>>>>>>> notes:
>>>>>>>>>
>>>>>>>>>    - We actually started a common view prototype using the single
>>>>>>>>>    catalog approach, but once we added more and more view metadata, storing
>>>>>>>>>    them in table properties became not manageable, especially for the feature
>>>>>>>>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>>>>>>>>    - We'd like to move away from Hive metastore
>>>>>>>>>
>>>>>>>>> For more details and discussion, see SPIP section "Background and
>>>>>>>>> Motivation".
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi John,
>>>>>>>>>>
>>>>>>>>>> Thanks for working on this! View support is very important to the
>>>>>>>>>> catalog plugin API.
>>>>>>>>>>
>>>>>>>>>> After reading your doc, I have one high-level question: should
>>>>>>>>>> view be a separated API or it's just a special type of table?
>>>>>>>>>>
>>>>>>>>>> AFAIK in most databases, tables and views share the same
>>>>>>>>>> namespace. You can't create a view if a same-name table exists. In Hive,
>>>>>>>>>> view is just a special type of table, so they are in the same namespace
>>>>>>>>>> naturally. If we have both table catalog and view catalog, we need a
>>>>>>>>>> mechanism to make sure there are no name conflicts.
>>>>>>>>>>
>>>>>>>>>> On the other hand, the view metadata is very simple that can be
>>>>>>>>>> put in table properties. I'd like to see more thoughts to evaluate these 2
>>>>>>>>>> approaches:
>>>>>>>>>> 1. *Add a new View API*. How to avoid name conflicts between
>>>>>>>>>> table and view? When resolving relation, shall we lookup table catalog
>>>>>>>>>> first or view catalog?
>>>>>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if
>>>>>>>>>> we do want to store table and views separately?
>>>>>>>>>>
>>>>>>>>>> I think a new View API is more flexible. I'd vote for it if we
>>>>>>>>>> can come up with a good mechanism to avoid name conflicts.
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Spark devs,
>>>>>>>>>>>
>>>>>>>>>>> I'd like to bring more attention to this SPIP. As
>>>>>>>>>>> Dongjoon indicated in the email "Apache Spark 3.1 Feature Expectation (Dec.
>>>>>>>>>>> 2020)", this feature can be considered for 3.2 or even 3.1.
>>>>>>>>>>>
>>>>>>>>>>> View catalog builds on top of the catalog plugin system
>>>>>>>>>>> introduced in DataSourceV2. It adds the “ViewCatalog” API to load, create,
>>>>>>>>>>> alter, and drop views. A catalog plugin can naturally implement both
>>>>>>>>>>> ViewCatalog and TableCatalog.
>>>>>>>>>>>
>>>>>>>>>>> Our internal implementation has been in production for over 8
>>>>>>>>>>> months. Recently we extended it to support materialized views, for the read
>>>>>>>>>>> path initially.
>>>>>>>>>>>
>>>>>>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>>>>>>>> to load, create, alter, and drop views.
>>>>>>>>>>>>
>>>>>>>>>>>> Document:
>>>>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>>>>>>
>>>>>>>>>>>> As part of a project to support common views across query
>>>>>>>>>>>> engines like Spark and Presto, my team used the view catalog API in Spark
>>>>>>>>>>>> implementation. The project has been in production over three months.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> John Zhuge
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> John Zhuge
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> John Zhuge
>>
>>
>>
>
> --
John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Holden has graciously agreed to shepherd the SPIP. Thanks!

On Thu, Feb 10, 2022 at 9:19 AM John Zhuge <jz...@apache.org> wrote:

> The vote is now closed and the vote passes. Thank you to everyone who took
> the time to review and vote on this SPIP. I’m looking forward to adding
> this feature to the next Spark release. The tracking JIRA is
> https://issues.apache.org/jira/browse/SPARK-31357.
>
> The tally is:
>
> +1s:
>
> Walaa Eldin Moustafa
> Erik Krogen
> Holden Karau (binding)
> Ryan Blue
> Chao Sun
> L C Hsieh (binding)
> Huaxin Gao
> Yufei Gu
> Terry Kim
> Jacky Lee
> Wenchen Fan (binding)
>
> 0s:
>
> -1s:
>
> On Mon, Feb 7, 2022 at 10:04 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> +1 (binding)
>>
>> On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee <qc...@gmail.com> wrote:
>>
>>> +1 (non-binding). Thanks John!
>>> It's great to see ViewCatalog moving on, it's a nice feature.
>>>
>>> Terry Kim <yu...@gmail.com> 于2022年2月5日周六 11:57写道:
>>>
>>>> +1 (non-binding). Thanks John!
>>>>
>>>> Terry
>>>>
>>>> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu <fl...@gmail.com> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>> Best,
>>>>>
>>>>> Yufei
>>>>>
>>>>> `This is not a contribution`
>>>>>
>>>>>
>>>>> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao <hu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh <vi...@gmail.com> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
>>>>>>> >
>>>>>>> > +1 (non-binding). Looking forward to this feature!
>>>>>>> >
>>>>>>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>> >>
>>>>>>> >> +1 for the SPIP. I think it's well designed and it has worked
>>>>>>> quite well at Netflix for a long time.
>>>>>>> >>
>>>>>>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi Spark community,
>>>>>>> >>>
>>>>>>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>>>>>>> (SPIP).
>>>>>>> >>>
>>>>>>> >>> The proposal is to add a ViewCatalog interface that can be used
>>>>>>> to load, create, alter, and drop views in DataSourceV2.
>>>>>>> >>>
>>>>>>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>>>>> >>>
>>>>>>> >>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>> >>> [ ] +0
>>>>>>> >>> [ ] -1: I don’t think this is a good idea because …
>>>>>>> >>>
>>>>>>> >>> Thanks!
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Ryan Blue
>>>>>>> >> Tabular
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>
> --
> John Zhuge
>


-- 
John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
The vote is now closed and the vote passes. Thank you to everyone who took
the time to review and vote on this SPIP. I’m looking forward to adding
this feature to the next Spark release. The tracking JIRA is
https://issues.apache.org/jira/browse/SPARK-31357.

The tally is:

+1s:

Walaa Eldin Moustafa
Erik Krogen
Holden Karau (binding)
Ryan Blue
Chao Sun
L C Hsieh (binding)
Huaxin Gao
Yufei Gu
Terry Kim
Jacky Lee
Wenchen Fan (binding)

0s:

-1s:

On Mon, Feb 7, 2022 at 10:04 PM Wenchen Fan <cl...@gmail.com> wrote:

> +1 (binding)
>
> On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee <qc...@gmail.com> wrote:
>
>> +1 (non-binding). Thanks John!
>> It's great to see ViewCatalog moving on, it's a nice feature.
>>
>> Terry Kim <yu...@gmail.com> 于2022年2月5日周六 11:57写道:
>>
>>> +1 (non-binding). Thanks John!
>>>
>>> Terry
>>>
>>> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu <fl...@gmail.com> wrote:
>>>
>>>> +1 (non-binding)
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>>> `This is not a contribution`
>>>>
>>>>
>>>> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao <hu...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh <vi...@gmail.com> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
>>>>>> >
>>>>>> > +1 (non-binding). Looking forward to this feature!
>>>>>> >
>>>>>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>> >>
>>>>>> >> +1 for the SPIP. I think it's well designed and it has worked
>>>>>> quite well at Netflix for a long time.
>>>>>> >>
>>>>>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hi Spark community,
>>>>>> >>>
>>>>>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>>>>>> (SPIP).
>>>>>> >>>
>>>>>> >>> The proposal is to add a ViewCatalog interface that can be used
>>>>>> to load, create, alter, and drop views in DataSourceV2.
>>>>>> >>>
>>>>>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>>>> >>>
>>>>>> >>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> >>> [ ] +0
>>>>>> >>> [ ] -1: I don’t think this is a good idea because …
>>>>>> >>>
>>>>>> >>> Thanks!
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Ryan Blue
>>>>>> >> Tabular
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>>

-- 
John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
+1 (binding)

On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee <qc...@gmail.com> wrote:

> +1 (non-binding). Thanks John!
> It's great to see ViewCatalog moving on, it's a nice feature.
>
> Terry Kim <yu...@gmail.com> 于2022年2月5日周六 11:57写道:
>
>> +1 (non-binding). Thanks John!
>>
>> Terry
>>
>> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu <fl...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>> Best,
>>>
>>> Yufei
>>>
>>> `This is not a contribution`
>>>
>>>
>>> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao <hu...@gmail.com>
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh <vi...@gmail.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
>>>>> >
>>>>> > +1 (non-binding). Looking forward to this feature!
>>>>> >
>>>>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>> >>
>>>>> >> +1 for the SPIP. I think it's well designed and it has worked quite
>>>>> well at Netflix for a long time.
>>>>> >>
>>>>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org>
>>>>> wrote:
>>>>> >>>
>>>>> >>> Hi Spark community,
>>>>> >>>
>>>>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>>>>> (SPIP).
>>>>> >>>
>>>>> >>> The proposal is to add a ViewCatalog interface that can be used to
>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>> >>>
>>>>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>>> >>>
>>>>> >>> [ ] +1: Accept the proposal as an official SPIP
>>>>> >>> [ ] +0
>>>>> >>> [ ] -1: I don’t think this is a good idea because …
>>>>> >>>
>>>>> >>> Thanks!
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Ryan Blue
>>>>> >> Tabular
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Jacky Lee <qc...@gmail.com>.
+1 (non-binding). Thanks John!
It's great to see ViewCatalog moving on, it's a nice feature.

Terry Kim <yu...@gmail.com> 于2022年2月5日周六 11:57写道:

> +1 (non-binding). Thanks John!
>
> Terry
>
> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu <fl...@gmail.com> wrote:
>
>> +1 (non-binding)
>> Best,
>>
>> Yufei
>>
>> `This is not a contribution`
>>
>>
>> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao <hu...@gmail.com>
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh <vi...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
>>>> >
>>>> > +1 (non-binding). Looking forward to this feature!
>>>> >
>>>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
>>>> >>
>>>> >> +1 for the SPIP. I think it's well designed and it has worked quite
>>>> well at Netflix for a long time.
>>>> >>
>>>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org> wrote:
>>>> >>>
>>>> >>> Hi Spark community,
>>>> >>>
>>>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>>>> (SPIP).
>>>> >>>
>>>> >>> The proposal is to add a ViewCatalog interface that can be used to
>>>> load, create, alter, and drop views in DataSourceV2.
>>>> >>>
>>>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>> >>>
>>>> >>> [ ] +1: Accept the proposal as an official SPIP
>>>> >>> [ ] +0
>>>> >>> [ ] -1: I don’t think this is a good idea because …
>>>> >>>
>>>> >>> Thanks!
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Ryan Blue
>>>> >> Tabular
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Terry Kim <yu...@gmail.com>.
+1 (non-binding). Thanks John!

Terry

On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu <fl...@gmail.com> wrote:

> +1 (non-binding)
> Best,
>
> Yufei
>
> `This is not a contribution`
>
>
> On Fri, Feb 4, 2022 at 11:54 AM huaxin gao <hu...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh <vi...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
>>> >
>>> > +1 (non-binding). Looking forward to this feature!
>>> >
>>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
>>> >>
>>> >> +1 for the SPIP. I think it's well designed and it has worked quite
>>> well at Netflix for a long time.
>>> >>
>>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org> wrote:
>>> >>>
>>> >>> Hi Spark community,
>>> >>>
>>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>>> (SPIP).
>>> >>>
>>> >>> The proposal is to add a ViewCatalog interface that can be used to
>>> load, create, alter, and drop views in DataSourceV2.
>>> >>>
>>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>> >>>
>>> >>> [ ] +1: Accept the proposal as an official SPIP
>>> >>> [ ] +0
>>> >>> [ ] -1: I don’t think this is a good idea because …
>>> >>>
>>> >>> Thanks!
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Ryan Blue
>>> >> Tabular
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Yufei Gu <fl...@gmail.com>.
+1 (non-binding)
Best,

Yufei

`This is not a contribution`


On Fri, Feb 4, 2022 at 11:54 AM huaxin gao <hu...@gmail.com> wrote:

> +1 (non-binding)
>
> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh <vi...@gmail.com> wrote:
>
>> +1
>>
>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
>> >
>> > +1 (non-binding). Looking forward to this feature!
>> >
>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
>> >>
>> >> +1 for the SPIP. I think it's well designed and it has worked quite
>> well at Netflix for a long time.
>> >>
>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org> wrote:
>> >>>
>> >>> Hi Spark community,
>> >>>
>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>> (SPIP).
>> >>>
>> >>> The proposal is to add a ViewCatalog interface that can be used to
>> load, create, alter, and drop views in DataSourceV2.
>> >>>
>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>> >>>
>> >>> [ ] +1: Accept the proposal as an official SPIP
>> >>> [ ] +0
>> >>> [ ] -1: I don’t think this is a good idea because …
>> >>>
>> >>> Thanks!
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Tabular
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by huaxin gao <hu...@gmail.com>.
+1 (non-binding)

On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh <vi...@gmail.com> wrote:

> +1
>
> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
> >
> > +1 (non-binding). Looking forward to this feature!
> >
> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
> >>
> >> +1 for the SPIP. I think it's well designed and it has worked quite
> well at Netflix for a long time.
> >>
> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org> wrote:
> >>>
> >>> Hi Spark community,
> >>>
> >>> I’d like to restart the vote for the ViewCatalog design proposal
> (SPIP).
> >>>
> >>> The proposal is to add a ViewCatalog interface that can be used to
> load, create, alter, and drop views in DataSourceV2.
> >>>
> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
> >>>
> >>> [ ] +1: Accept the proposal as an official SPIP
> >>> [ ] +0
> >>> [ ] -1: I don’t think this is a good idea because …
> >>>
> >>> Thanks!
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Tabular
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by "L. C. Hsieh" <vi...@gmail.com>.
+1

On Thu, Feb 3, 2022 at 7:25 PM Chao Sun <su...@apache.org> wrote:
>
> +1 (non-binding). Looking forward to this feature!
>
> On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>> +1 for the SPIP. I think it's well designed and it has worked quite well at Netflix for a long time.
>>
>> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org> wrote:
>>>
>>> Hi Spark community,
>>>
>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to load, create, alter, and drop views in DataSourceV2.
>>>
>>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>
>>
>>
>> --
>> Ryan Blue
>> Tabular

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Chao Sun <su...@apache.org>.
+1 (non-binding). Looking forward to this feature!

On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue <bl...@tabular.io> wrote:

> +1 for the SPIP. I think it's well designed and it has worked quite well
> at Netflix for a long time.
>
> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org> wrote:
>
>> Hi Spark community,
>>
>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>>
>> The proposal is to add a ViewCatalog interface that can be used to load,
>> create, alter, and drop views in DataSourceV2.
>>
>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Ryan Blue <bl...@tabular.io>.
+1 for the SPIP. I think it's well designed and it has worked quite well at
Netflix for a long time.

On Thu, Feb 3, 2022 at 2:04 PM John Zhuge <jz...@apache.org> wrote:

> Hi Spark community,
>
> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP until Feb. 9th (Wednesday).
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>


-- 
Ryan Blue
Tabular

[VOTE] SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Hi Spark community,

I’d like to restart the vote for the ViewCatalog design proposal (SPIP).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

Please vote on the SPIP until Feb. 9th (Wednesday).

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Sure Xiao.

Happy Lunar New Year!

On Thu, Feb 3, 2022 at 1:57 PM Xiao Li <ga...@gmail.com> wrote:

> Can we extend the voting window to next Wednesday? This week is a holiday
> week for the lunar new year. AFAIK, many members in Asia are taking the
> whole week off. They might not regularly check the emails.
>
> Also how about starting a separate email thread starting with [VOTE] ?
>
> Happy Lunar New Year!!!
>
> Xiao
>
> Holden Karau <ho...@pigscanfly.ca> 于2022年2月3日周四 12:28写道:
>
>> +1 (binding)
>>
>> On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen <xk...@apache.org> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Really looking forward to having this natively supported by Spark, so
>>> that we can get rid of our own hacks to tie in a custom view catalog
>>> implementation. I appreciate the care John has put into various parts of
>>> the design and believe this will provide a robust and flexible solution to
>>> this problem faced by various large-scale Spark users.
>>>
>>> Thanks John!
>>>
>>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
>>> wa.moustafa@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Hi Spark community,
>>>>>
>>>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>>> ).
>>>>>
>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>
>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>> I’ll update the PR <https://github.com/apache/spark/pull/28147> for
>>>>> review.
>>>>>
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> [ ] +0
>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>>>>> wa.moustafa@gmail.com> wrote:
>>>>>
>>>>>> Considering the API aspect, the ViewCatalog API sounds like a good
>>>>>> idea. A view catalog will enable us to integrate Coral
>>>>>> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
>>>>>> translation and management layer) very cleanly to Spark. Currently we can
>>>>>> only do it by maintaining our special version of the
>>>>>> HiveExternalCatalog. Considering that views can be expanded
>>>>>> syntactically without necessarily invoking the analyzer, using a dedicated
>>>>>> view API can make performance better if performance is the concern.
>>>>>> Further, a catalog can still be both a table and view provider if it
>>>>>> chooses to based on this design, so I do not think we necessarily lose the
>>>>>> ability of providing both. Looking forward to more discussions on this and
>>>>>> making views a powerful tool in Spark.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Wed, May 26, 2021 at 9:54 AM John Zhuge <jz...@apache.org> wrote:
>>>>>>
>>>>>>> Looks like we are running in circles. Should we have an online
>>>>>>> meeting to get this sorted out?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John
>>>>>>>
>>>>>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> OK, then I'd vote for TableViewCatalog, because
>>>>>>>> 1. This is how Hive catalog works, and we need to migrate Hive
>>>>>>>> catalog to the v2 API sooner or later.
>>>>>>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>>>>>>> table/view resolution framework.
>>>>>>>> 3. It's better to avoid name conflicts between table and views at
>>>>>>>> the API level, instead of relying on the catalog implementation.
>>>>>>>> 4. Caching invalidation is always a tricky problem.
>>>>>>>>
>>>>>>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I don't think that it makes sense to discuss a different approach
>>>>>>>>> in the PR rather than in the vote. Let's discuss this now since that's the
>>>>>>>>> purpose of an SPIP.
>>>>>>>>>
>>>>>>>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>>>>>>>> proposal (SPIP).
>>>>>>>>>>
>>>>>>>>>> The proposal is to add a ViewCatalog interface that can be used
>>>>>>>>>> to load, create, alter, and drop views in DataSourceV2.
>>>>>>>>>>
>>>>>>>>>> The full SPIP doc is here:
>>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>>>
>>>>>>>>>> Please vote on the SPIP in the next 72 hours. Once it is
>>>>>>>>>> approved, I’ll update the PR for review.
>>>>>>>>>>
>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>> [ ] +0
>>>>>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Xiao Li <ga...@gmail.com>.
Can we extend the voting window to next Wednesday? This week is a holiday
week for the lunar new year. AFAIK, many members in Asia are taking the
whole week off. They might not regularly check the emails.

Also how about starting a separate email thread starting with [VOTE] ?

Happy Lunar New Year!!!

Xiao

Holden Karau <ho...@pigscanfly.ca> 于2022年2月3日周四 12:28写道:

> +1 (binding)
>
> On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen <xk...@apache.org> wrote:
>
>> +1 (non-binding)
>>
>> Really looking forward to having this natively supported by Spark, so
>> that we can get rid of our own hacks to tie in a custom view catalog
>> implementation. I appreciate the care John has put into various parts of
>> the design and believe this will provide a robust and flexible solution to
>> this problem faced by various large-scale Spark users.
>>
>> Thanks John!
>>
>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
>> wa.moustafa@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Hi Spark community,
>>>>
>>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>> ).
>>>>
>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>> load, create, alter, and drop views in DataSourceV2.
>>>>
>>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>>> update the PR <https://github.com/apache/spark/pull/28147> for review.
>>>>
>>>> [ ] +1: Accept the proposal as an official SPIP
>>>> [ ] +0
>>>> [ ] -1: I don’t think this is a good idea because …
>>>>
>>>> Thanks!
>>>>
>>>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>>>> wa.moustafa@gmail.com> wrote:
>>>>
>>>>> Considering the API aspect, the ViewCatalog API sounds like a good
>>>>> idea. A view catalog will enable us to integrate Coral
>>>>> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
>>>>> translation and management layer) very cleanly to Spark. Currently we can
>>>>> only do it by maintaining our special version of the
>>>>> HiveExternalCatalog. Considering that views can be expanded
>>>>> syntactically without necessarily invoking the analyzer, using a dedicated
>>>>> view API can make performance better if performance is the concern.
>>>>> Further, a catalog can still be both a table and view provider if it
>>>>> chooses to based on this design, so I do not think we necessarily lose the
>>>>> ability of providing both. Looking forward to more discussions on this and
>>>>> making views a powerful tool in Spark.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Wed, May 26, 2021 at 9:54 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Looks like we are running in circles. Should we have an online
>>>>>> meeting to get this sorted out?
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>
>>>>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> OK, then I'd vote for TableViewCatalog, because
>>>>>>> 1. This is how Hive catalog works, and we need to migrate Hive
>>>>>>> catalog to the v2 API sooner or later.
>>>>>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>>>>>> table/view resolution framework.
>>>>>>> 3. It's better to avoid name conflicts between table and views at
>>>>>>> the API level, instead of relying on the catalog implementation.
>>>>>>> 4. Caching invalidation is always a tricky problem.
>>>>>>>
>>>>>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I don't think that it makes sense to discuss a different approach
>>>>>>>> in the PR rather than in the vote. Let's discuss this now since that's the
>>>>>>>> purpose of an SPIP.
>>>>>>>>
>>>>>>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>>>>>>> proposal (SPIP).
>>>>>>>>>
>>>>>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>>>>>
>>>>>>>>> The full SPIP doc is here:
>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>>>>>> I’ll update the PR for review.
>>>>>>>>>
>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>> [ ] +0
>>>>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Holden Karau <ho...@pigscanfly.ca>.
+1 (binding)

On Thu, Feb 3, 2022 at 2:26 PM Erik Krogen <xk...@apache.org> wrote:

> +1 (non-binding)
>
> Really looking forward to having this natively supported by Spark, so that
> we can get rid of our own hacks to tie in a custom view catalog
> implementation. I appreciate the care John has put into various parts of
> the design and believe this will provide a robust and flexible solution to
> this problem faced by various large-scale Spark users.
>
> Thanks John!
>
> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
> wa.moustafa@gmail.com> wrote:
>
>> +1
>>
>> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> Hi Spark community,
>>>
>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>> ).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to load,
>>> create, alter, and drop views in DataSourceV2.
>>>
>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>> update the PR <https://github.com/apache/spark/pull/28147> for review.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>>
>>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>>> wa.moustafa@gmail.com> wrote:
>>>
>>>> Considering the API aspect, the ViewCatalog API sounds like a good
>>>> idea. A view catalog will enable us to integrate Coral
>>>> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
>>>> translation and management layer) very cleanly to Spark. Currently we can
>>>> only do it by maintaining our special version of the
>>>> HiveExternalCatalog. Considering that views can be expanded
>>>> syntactically without necessarily invoking the analyzer, using a dedicated
>>>> view API can make performance better if performance is the concern.
>>>> Further, a catalog can still be both a table and view provider if it
>>>> chooses to based on this design, so I do not think we necessarily lose the
>>>> ability of providing both. Looking forward to more discussions on this and
>>>> making views a powerful tool in Spark.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Wed, May 26, 2021 at 9:54 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Looks like we are running in circles. Should we have an online meeting
>>>>> to get this sorted out?
>>>>>
>>>>> Thanks,
>>>>> John
>>>>>
>>>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> OK, then I'd vote for TableViewCatalog, because
>>>>>> 1. This is how Hive catalog works, and we need to migrate Hive
>>>>>> catalog to the v2 API sooner or later.
>>>>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>>>>> table/view resolution framework.
>>>>>> 3. It's better to avoid name conflicts between table and views at the
>>>>>> API level, instead of relying on the catalog implementation.
>>>>>> 4. Caching invalidation is always a tricky problem.
>>>>>>
>>>>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> I don't think that it makes sense to discuss a different approach in
>>>>>>> the PR rather than in the vote. Let's discuss this now since that's the
>>>>>>> purpose of an SPIP.
>>>>>>>
>>>>>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>>>>>> proposal (SPIP).
>>>>>>>>
>>>>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>>>>
>>>>>>>> The full SPIP doc is here:
>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>
>>>>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>>>>> I’ll update the PR for review.
>>>>>>>>
>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>> [ ] +0
>>>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Erik Krogen <xk...@apache.org>.
+1 (non-binding)

Really looking forward to having this natively supported by Spark, so that
we can get rid of our own hacks to tie in a custom view catalog
implementation. I appreciate the care John has put into various parts of
the design and believe this will provide a robust and flexible solution to
this problem faced by various large-scale Spark users.

Thanks John!

On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <wa...@gmail.com>
wrote:

> +1
>
> On Thu, Feb 3, 2022 at 11:19 AM John Zhuge <jz...@apache.org> wrote:
>
>> Hi Spark community,
>>
>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>> ).
>>
>> The proposal is to add a ViewCatalog interface that can be used to load,
>> create, alter, and drop views in DataSourceV2.
>>
>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>> update the PR <https://github.com/apache/spark/pull/28147> for review.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>>
>> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <
>> wa.moustafa@gmail.com> wrote:
>>
>>> Considering the API aspect, the ViewCatalog API sounds like a good idea.
>>> A view catalog will enable us to integrate Coral
>>> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
>>> translation and management layer) very cleanly to Spark. Currently we can
>>> only do it by maintaining our special version of the HiveExternalCatalog.
>>> Considering that views can be expanded syntactically without necessarily
>>> invoking the analyzer, using a dedicated view API can make performance
>>> better if performance is the concern. Further, a catalog can still be both
>>> a table and view provider if it chooses to based on this design, so I do
>>> not think we necessarily lose the ability of providing both. Looking
>>> forward to more discussions on this and making views a powerful tool in
>>> Spark.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Wed, May 26, 2021 at 9:54 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Looks like we are running in circles. Should we have an online meeting
>>>> to get this sorted out?
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> OK, then I'd vote for TableViewCatalog, because
>>>>> 1. This is how Hive catalog works, and we need to migrate Hive catalog
>>>>> to the v2 API sooner or later.
>>>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>>>> table/view resolution framework.
>>>>> 3. It's better to avoid name conflicts between table and views at the
>>>>> API level, instead of relying on the catalog implementation.
>>>>> 4. Caching invalidation is always a tricky problem.
>>>>>
>>>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I don't think that it makes sense to discuss a different approach in
>>>>>> the PR rather than in the vote. Let's discuss this now since that's the
>>>>>> purpose of an SPIP.
>>>>>>
>>>>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>>>>> proposal (SPIP).
>>>>>>>
>>>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>>>
>>>>>>> The full SPIP doc is here:
>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>
>>>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>>>> I’ll update the PR for review.
>>>>>>>
>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>> [ ] +0
>>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.
+1

On Thu, Feb 3, 2022 at 11:19 AM John Zhuge <jz...@apache.org> wrote:

> Hi Spark community,
>
> I’d like to restart the vote for the ViewCatalog design proposal (SPIP
> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
> ).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
> update the PR <https://github.com/apache/spark/pull/28147> for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>
> On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <wa...@gmail.com>
> wrote:
>
>> Considering the API aspect, the ViewCatalog API sounds like a good idea.
>> A view catalog will enable us to integrate Coral
>> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
>> translation and management layer) very cleanly to Spark. Currently we can
>> only do it by maintaining our special version of the HiveExternalCatalog.
>> Considering that views can be expanded syntactically without necessarily
>> invoking the analyzer, using a dedicated view API can make performance
>> better if performance is the concern. Further, a catalog can still be both
>> a table and view provider if it chooses to based on this design, so I do
>> not think we necessarily lose the ability of providing both. Looking
>> forward to more discussions on this and making views a powerful tool in
>> Spark.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Wed, May 26, 2021 at 9:54 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> Looks like we are running in circles. Should we have an online meeting
>>> to get this sorted out?
>>>
>>> Thanks,
>>> John
>>>
>>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com>
>>> wrote:
>>>
>>>> OK, then I'd vote for TableViewCatalog, because
>>>> 1. This is how Hive catalog works, and we need to migrate Hive catalog
>>>> to the v2 API sooner or later.
>>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>>> table/view resolution framework.
>>>> 3. It's better to avoid name conflicts between table and views at the
>>>> API level, instead of relying on the catalog implementation.
>>>> 4. Caching invalidation is always a tricky problem.
>>>>
>>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> I don't think that it makes sense to discuss a different approach in
>>>>> the PR rather than in the vote. Let's discuss this now since that's the
>>>>> purpose of an SPIP.
>>>>>
>>>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>>>> proposal (SPIP).
>>>>>>
>>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>>
>>>>>> The full SPIP doc is here:
>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>
>>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>>> I’ll update the PR for review.
>>>>>>
>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> [ ] +0
>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> John Zhuge
>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Hi Spark community,

I’d like to restart the vote for the ViewCatalog design proposal (SPIP
<https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
update the PR <https://github.com/apache/spark/pull/28147> for review.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!

On Fri, Jun 4, 2021 at 1:46 PM Walaa Eldin Moustafa <wa...@gmail.com>
wrote:

> Considering the API aspect, the ViewCatalog API sounds like a good idea. A
> view catalog will enable us to integrate Coral
> <https://engineering.linkedin.com/blog/2020/coral> (our view SQL
> translation and management layer) very cleanly to Spark. Currently we can
> only do it by maintaining our special version of the HiveExternalCatalog.
> Considering that views can be expanded syntactically without necessarily
> invoking the analyzer, using a dedicated view API can make performance
> better if performance is the concern. Further, a catalog can still be both
> a table and view provider if it chooses to based on this design, so I do
> not think we necessarily lose the ability of providing both. Looking
> forward to more discussions on this and making views a powerful tool in
> Spark.
>
> Thanks,
> Walaa.
>
>
> On Wed, May 26, 2021 at 9:54 AM John Zhuge <jz...@apache.org> wrote:
>
>> Looks like we are running in circles. Should we have an online meeting to
>> get this sorted out?
>>
>> Thanks,
>> John
>>
>> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> OK, then I'd vote for TableViewCatalog, because
>>> 1. This is how Hive catalog works, and we need to migrate Hive catalog
>>> to the v2 API sooner or later.
>>> 2. Because of 1, TableViewCatalog is easy to support in the current
>>> table/view resolution framework.
>>> 3. It's better to avoid name conflicts between table and views at the
>>> API level, instead of relying on the catalog implementation.
>>> 4. Caching invalidation is always a tricky problem.
>>>
>>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I don't think that it makes sense to discuss a different approach in
>>>> the PR rather than in the vote. Let's discuss this now since that's the
>>>> purpose of an SPIP.
>>>>
>>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>>> proposal (SPIP).
>>>>>
>>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>>> load, create, alter, and drop views in DataSourceV2.
>>>>>
>>>>> The full SPIP doc is here:
>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>
>>>>> Please vote on the SPIP in the next 72 hours. Once it is approved,
>>>>> I’ll update the PR for review.
>>>>>
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> [ ] +0
>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.
Considering the API aspect, the ViewCatalog API sounds like a good idea. A
view catalog will enable us to integrate Coral
<https://engineering.linkedin.com/blog/2020/coral> (our view SQL
translation and management layer) very cleanly to Spark. Currently we can
only do it by maintaining our special version of the HiveExternalCatalog.
Considering that views can be expanded syntactically without necessarily
invoking the analyzer, using a dedicated view API can make performance
better if performance is the concern. Further, a catalog can still be both
a table and view provider if it chooses to based on this design, so I do
not think we necessarily lose the ability of providing both. Looking
forward to more discussions on this and making views a powerful tool in
Spark.

Thanks,
Walaa.


On Wed, May 26, 2021 at 9:54 AM John Zhuge <jz...@apache.org> wrote:

> Looks like we are running in circles. Should we have an online meeting to
> get this sorted out?
>
> Thanks,
> John
>
> On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> OK, then I'd vote for TableViewCatalog, because
>> 1. This is how Hive catalog works, and we need to migrate Hive catalog to
>> the v2 API sooner or later.
>> 2. Because of 1, TableViewCatalog is easy to support in the current
>> table/view resolution framework.
>> 3. It's better to avoid name conflicts between table and views at the API
>> level, instead of relying on the catalog implementation.
>> 4. Caching invalidation is always a tricky problem.
>>
>> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I don't think that it makes sense to discuss a different approach in the
>>> PR rather than in the vote. Let's discuss this now since that's the purpose
>>> of an SPIP.
>>>
>>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>>> proposal (SPIP).
>>>>
>>>> The proposal is to add a ViewCatalog interface that can be used to
>>>> load, create, alter, and drop views in DataSourceV2.
>>>>
>>>> The full SPIP doc is here:
>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>
>>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>>> update the PR for review.
>>>>
>>>> [ ] +1: Accept the proposal as an official SPIP
>>>> [ ] +0
>>>> [ ] -1: I don’t think this is a good idea because …
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> John Zhuge
>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Looks like we are running in circles. Should we have an online meeting to
get this sorted out?

Thanks,
John

On Wed, May 26, 2021 at 12:01 AM Wenchen Fan <cl...@gmail.com> wrote:

> OK, then I'd vote for TableViewCatalog, because
> 1. This is how Hive catalog works, and we need to migrate Hive catalog to
> the v2 API sooner or later.
> 2. Because of 1, TableViewCatalog is easy to support in the current
> table/view resolution framework.
> 3. It's better to avoid name conflicts between table and views at the API
> level, instead of relying on the catalog implementation.
> 4. Caching invalidation is always a tricky problem.
>
> On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I don't think that it makes sense to discuss a different approach in the
>> PR rather than in the vote. Let's discuss this now since that's the purpose
>> of an SPIP.
>>
>> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> Hi everyone, I’d like to start a vote for the ViewCatalog design
>>> proposal (SPIP).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to load,
>>> create, alter, and drop views in DataSourceV2.
>>>
>>> The full SPIP doc is here:
>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>
>>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>>> update the PR for review.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
OK, then I'd vote for TableViewCatalog, because
1. This is how Hive catalog works, and we need to migrate Hive catalog to
the v2 API sooner or later.
2. Because of 1, TableViewCatalog is easy to support in the current
table/view resolution framework.
3. It's better to avoid name conflicts between table and views at the API
level, instead of relying on the catalog implementation.
4. Caching invalidation is always a tricky problem.

On Tue, May 25, 2021 at 3:09 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I don't think that it makes sense to discuss a different approach in the
> PR rather than in the vote. Let's discuss this now since that's the purpose
> of an SPIP.
>
> On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org> wrote:
>
>> Hi everyone, I’d like to start a vote for the ViewCatalog design proposal
>> (SPIP).
>>
>> The proposal is to add a ViewCatalog interface that can be used to load,
>> create, alter, and drop views in DataSourceV2.
>>
>> The full SPIP doc is here:
>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>
>> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
>> update the PR for review.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] SPIP: Catalog API for view metadata

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I don't think that it makes sense to discuss a different approach in the PR
rather than in the vote. Let's discuss this now since that's the purpose of
an SPIP.

On Mon, May 24, 2021 at 11:22 AM John Zhuge <jz...@apache.org> wrote:

> Hi everyone, I’d like to start a vote for the ViewCatalog design proposal
> (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
> update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>


-- 
Ryan Blue
Software Engineer
Netflix

[VOTE] SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Hi everyone, I’d like to start a vote for the ViewCatalog design proposal
(SPIP).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

The full SPIP doc is here:
https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing

Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
update the PR for review.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Great! I will start a vote thread.

On Mon, May 24, 2021 at 10:54 AM Wenchen Fan <cl...@gmail.com> wrote:

> Yea let's move forward first. We can discuss the caching approach
> and TableViewCatalog approach during the PR review.
>
> On Tue, May 25, 2021 at 1:48 AM John Zhuge <jz...@apache.org> wrote:
>
>> Hi everyone,
>>
>> Is there any more discussion before we start a vote on ViewCatalog? With
>> FunctionCatalog merged, I hope this feature can complete the offerings of
>> catalog plugins in 3.2.
>>
>> Once approved, I will refresh the WIP PR. Implementation details can be
>> ironed out during review.
>>
>> Thanks,
>>
>> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> An extra RPC call is a concern for the catalog implementation. It is
>>> simple to cache the result of a call to avoid a second one if the catalog
>>> chooses.
>>>
>>> I don't think that an extra RPC that can be easily avoided is a
>>> reasonable justification to add caches in Spark. For one thing, it doesn't
>>> solve the problem because the proposed API still requires separate lookups
>>> for tables and views.
>>>
>>> The only solution that would help is to use a combined trait, but that
>>> has issues. For one, view substitution is much cleaner when it happens well
>>> before table resolution. And, View and Table are very different objects;
>>> returning Object from this API doesn't make much sense.
>>>
>>> One extra RPC is not unreasonable, and the choice should be left to
>>> sources. That's the easiest place to cache results from the underlying
>>> store.
>>>
>>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> Moving back the discussion to this thread. The current argument is how
>>>> to avoid extra RPC calls for catalogs supporting both table and view. There
>>>> are several options:
>>>> 1. ignore it as extra PRC calls are cheap compared to the query
>>>> execution
>>>> 2. have a per session cache for loaded table/view
>>>> 3. have a per query cache for loaded table/view
>>>> 4. add a new trait TableViewCatalog
>>>>
>>>> I think it's important to avoid perf regression with new APIs. RPC
>>>> calls can be significant for short queries. We may also double the RPC
>>>> traffic which is bad for the metastore service. Normally I would not
>>>> recommend caching as cache invalidation is a hard problem. Personally I
>>>> prefer option 4 as it only affects catalogs that support both table and
>>>> view, and it fits the hive catalog very well.
>>>>
>>>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> SPIP
>>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>>> has been updated. Please review.
>>>>>
>>>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>>>
>>>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Any updates here? I agree that a new View API is better, but we need
>>>>>>> a solution to avoid performance regression. We need to elaborate on the
>>>>>>> cache idea.
>>>>>>>
>>>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>>>
>>>>>>>> The main two arguments I’ve heard for combining lookup into a
>>>>>>>> single function are the ones brought up in this thread. First, an
>>>>>>>> identifier in a catalog must be either a view or a table and should not
>>>>>>>> collide. Second, a single lookup is more likely to require a single RPC. I
>>>>>>>> think the RPC concern is well addressed by caching, which we already do in
>>>>>>>> the Spark catalog, so I’ll primarily focus on the first.
>>>>>>>>
>>>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>>>> that support both today store them in a single namespace, so this is not a
>>>>>>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>>>>>>> know that a new metastore catalog could choose to implement both
>>>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>>>>> would be a very strange choice: if the metastore itself has different
>>>>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>>>>> them through separate catalogs because Spark will always prefer one over
>>>>>>>> the other.
>>>>>>>>
>>>>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>>>>> tables are much more rare than catalogs that only expose one. For example,
>>>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>>>>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>>>>>> other representation into a SQL view. It would make little sense to combine
>>>>>>>> that with some other TableCatalog.
>>>>>>>>
>>>>>>>> I also don’t think there is benefit from an API perspective to
>>>>>>>> justify combining the Table and View interfaces. The two share only schema
>>>>>>>> and properties, and are handled very differently internally — a View’s SQL
>>>>>>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>>>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>>>>> catalog and namespace from the time the view was created.
>>>>>>>>
>>>>>>>> Query planning is distinct between tables and views, so Spark
>>>>>>>> doesn’t benefit from combining them. I think it has actually caused
>>>>>>>> problems that both were resolved by the same method in v1: the resolution
>>>>>>>> rule grew extremely complicated trying to look up a reference just once
>>>>>>>> because it had to parse a view plan and resolve relations within it using
>>>>>>>> the view’s context (current database). In contrast, John’s new view
>>>>>>>> substitution rules are cleaner and can stay within the substitution batch.
>>>>>>>>
>>>>>>>> People implementing views would also not benefit from combining the
>>>>>>>> two interfaces:
>>>>>>>>
>>>>>>>>    - There is little overlap between View and Table, only schema
>>>>>>>>    and properties
>>>>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>>>>    ViewOrTable is more difficult for implementations
>>>>>>>>    - TableCatalog assumes that ViewCatalog will be added
>>>>>>>>    separately like John proposes, so we would have to break or replace that API
>>>>>>>>
>>>>>>>> I understand the initial appeal of combining TableCatalog and
>>>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>>>>> already stored together, and not because it made sense for users of the
>>>>>>>> API, or any other implementer of the API.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>>>>
>>>>>>>>>> Correction: Spark adds a new Project at the top of the parsed
>>>>>>>>>> plan from view, based on the stored schema, to make sure the view schema
>>>>>>>>>> doesn't change.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>>>>> validation done in *checkAnalysis*:
>>>>>>>>>
>>>>>>>>>           // If the view output doesn't have the same number of
>>>>>>>>> columns neither with the child
>>>>>>>>>           // output, nor with the query column names, throw an
>>>>>>>>> AnalysisException.
>>>>>>>>>           // If the view's child output can't up cast to the view
>>>>>>>>> output,
>>>>>>>>>           // throw an AnalysisException, too.
>>>>>>>>>
>>>>>>>>> The view output comes from the schema:
>>>>>>>>>
>>>>>>>>>       val child = View(
>>>>>>>>>         desc = metadata,
>>>>>>>>>         output = metadata.schema.toAttributes,
>>>>>>>>>         child = parser.parsePlan(viewText))
>>>>>>>>>
>>>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have
>>>>>>>>> but not "required" or "should be frozen". Thanks Ryan and Burak for
>>>>>>>>> pointing that out in SPIP. I will add a new paragraph accordingly.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
Yea let's move forward first. We can discuss the caching approach
and TableViewCatalog approach during the PR review.

On Tue, May 25, 2021 at 1:48 AM John Zhuge <jz...@apache.org> wrote:

> Hi everyone,
>
> Is there any more discussion before we start a vote on ViewCatalog? With
> FunctionCatalog merged, I hope this feature can complete the offerings of
> catalog plugins in 3.2.
>
> Once approved, I will refresh the WIP PR. Implementation details can be
> ironed out during review.
>
> Thanks,
>
> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> An extra RPC call is a concern for the catalog implementation. It is
>> simple to cache the result of a call to avoid a second one if the catalog
>> chooses.
>>
>> I don't think that an extra RPC that can be easily avoided is a
>> reasonable justification to add caches in Spark. For one thing, it doesn't
>> solve the problem because the proposed API still requires separate lookups
>> for tables and views.
>>
>> The only solution that would help is to use a combined trait, but that
>> has issues. For one, view substitution is much cleaner when it happens well
>> before table resolution. And, View and Table are very different objects;
>> returning Object from this API doesn't make much sense.
>>
>> One extra RPC is not unreasonable, and the choice should be left to
>> sources. That's the easiest place to cache results from the underlying
>> store.
>>
>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> Moving back the discussion to this thread. The current argument is how
>>> to avoid extra RPC calls for catalogs supporting both table and view. There
>>> are several options:
>>> 1. ignore it as extra PRC calls are cheap compared to the query execution
>>> 2. have a per session cache for loaded table/view
>>> 3. have a per query cache for loaded table/view
>>> 4. add a new trait TableViewCatalog
>>>
>>> I think it's important to avoid perf regression with new APIs. RPC calls
>>> can be significant for short queries. We may also double the RPC
>>> traffic which is bad for the metastore service. Normally I would not
>>> recommend caching as cache invalidation is a hard problem. Personally I
>>> prefer option 4 as it only affects catalogs that support both table and
>>> view, and it fits the hive catalog very well.
>>>
>>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> SPIP
>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>> has been updated. Please review.
>>>>
>>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>>
>>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Any updates here? I agree that a new View API is better, but we need
>>>>>> a solution to avoid performance regression. We need to elaborate on the
>>>>>> cache idea.
>>>>>>
>>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>>
>>>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>>>> function are the ones brought up in this thread. First, an identifier in a
>>>>>>> catalog must be either a view or a table and should not collide. Second, a
>>>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>>>> catalog, so I’ll primarily focus on the first.
>>>>>>>
>>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>>> that support both today store them in a single namespace, so this is not a
>>>>>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>>>>>> know that a new metastore catalog could choose to implement both
>>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>>>> would be a very strange choice: if the metastore itself has different
>>>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>>>> them through separate catalogs because Spark will always prefer one over
>>>>>>> the other.
>>>>>>>
>>>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>>>> tables are much more rare than catalogs that only expose one. For example,
>>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>>>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>>>>> other representation into a SQL view. It would make little sense to combine
>>>>>>> that with some other TableCatalog.
>>>>>>>
>>>>>>> I also don’t think there is benefit from an API perspective to
>>>>>>> justify combining the Table and View interfaces. The two share only schema
>>>>>>> and properties, and are handled very differently internally — a View’s SQL
>>>>>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>>>> catalog and namespace from the time the view was created.
>>>>>>>
>>>>>>> Query planning is distinct between tables and views, so Spark
>>>>>>> doesn’t benefit from combining them. I think it has actually caused
>>>>>>> problems that both were resolved by the same method in v1: the resolution
>>>>>>> rule grew extremely complicated trying to look up a reference just once
>>>>>>> because it had to parse a view plan and resolve relations within it using
>>>>>>> the view’s context (current database). In contrast, John’s new view
>>>>>>> substitution rules are cleaner and can stay within the substitution batch.
>>>>>>>
>>>>>>> People implementing views would also not benefit from combining the
>>>>>>> two interfaces:
>>>>>>>
>>>>>>>    - There is little overlap between View and Table, only schema
>>>>>>>    and properties
>>>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>>>    ViewOrTable is more difficult for implementations
>>>>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>>>>    like John proposes, so we would have to break or replace that API
>>>>>>>
>>>>>>> I understand the initial appeal of combining TableCatalog and
>>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>>>> already stored together, and not because it made sense for users of the
>>>>>>> API, or any other implementer of the API.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>>>
>>>>>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>>>>>> from view, based on the stored schema, to make sure the view schema doesn't
>>>>>>>>> change.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>>>> validation done in *checkAnalysis*:
>>>>>>>>
>>>>>>>>           // If the view output doesn't have the same number of
>>>>>>>> columns neither with the child
>>>>>>>>           // output, nor with the query column names, throw an
>>>>>>>> AnalysisException.
>>>>>>>>           // If the view's child output can't up cast to the view
>>>>>>>> output,
>>>>>>>>           // throw an AnalysisException, too.
>>>>>>>>
>>>>>>>> The view output comes from the schema:
>>>>>>>>
>>>>>>>>       val child = View(
>>>>>>>>         desc = metadata,
>>>>>>>>         output = metadata.schema.toAttributes,
>>>>>>>>         child = parser.parsePlan(viewText))
>>>>>>>>
>>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have
>>>>>>>> but not "required" or "should be frozen". Thanks Ryan and Burak for
>>>>>>>> pointing that out in SPIP. I will add a new paragraph accordingly.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Hi everyone,

Is there any more discussion before we start a vote on ViewCatalog? With
FunctionCatalog merged, I hope this feature can complete the offerings of
catalog plugins in 3.2.

Once approved, I will refresh the WIP PR. Implementation details can be
ironed out during review.

Thanks,

On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> An extra RPC call is a concern for the catalog implementation. It is
> simple to cache the result of a call to avoid a second one if the catalog
> chooses.
>
> I don't think that an extra RPC that can be easily avoided is a reasonable
> justification to add caches in Spark. For one thing, it doesn't solve the
> problem because the proposed API still requires separate lookups for tables
> and views.
>
> The only solution that would help is to use a combined trait, but that has
> issues. For one, view substitution is much cleaner when it happens well
> before table resolution. And, View and Table are very different objects;
> returning Object from this API doesn't make much sense.
>
> One extra RPC is not unreasonable, and the choice should be left to
> sources. That's the easiest place to cache results from the underlying
> store.
>
> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Moving back the discussion to this thread. The current argument is how to
>> avoid extra RPC calls for catalogs supporting both table and view. There
>> are several options:
>> 1. ignore it as extra PRC calls are cheap compared to the query execution
>> 2. have a per session cache for loaded table/view
>> 3. have a per query cache for loaded table/view
>> 4. add a new trait TableViewCatalog
>>
>> I think it's important to avoid perf regression with new APIs. RPC calls
>> can be significant for short queries. We may also double the RPC
>> traffic which is bad for the metastore service. Normally I would not
>> recommend caching as cache invalidation is a hard problem. Personally I
>> prefer option 4 as it only affects catalogs that support both table and
>> view, and it fits the hive catalog very well.
>>
>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jz...@apache.org> wrote:
>>
>>> SPIP
>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>> has been updated. Please review.
>>>
>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>
>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>>
>>>>> Any updates here? I agree that a new View API is better, but we need a
>>>>> solution to avoid performance regression. We need to elaborate on the cache
>>>>> idea.
>>>>>
>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>
>>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>>> function are the ones brought up in this thread. First, an identifier in a
>>>>>> catalog must be either a view or a table and should not collide. Second, a
>>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>>> catalog, so I’ll primarily focus on the first.
>>>>>>
>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>> that support both today store them in a single namespace, so this is not a
>>>>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>>>>> know that a new metastore catalog could choose to implement both
>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>>> would be a very strange choice: if the metastore itself has different
>>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>>> them through separate catalogs because Spark will always prefer one over
>>>>>> the other.
>>>>>>
>>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>>> tables are much more rare than catalogs that only expose one. For example,
>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>>>> other representation into a SQL view. It would make little sense to combine
>>>>>> that with some other TableCatalog.
>>>>>>
>>>>>> I also don’t think there is benefit from an API perspective to
>>>>>> justify combining the Table and View interfaces. The two share only schema
>>>>>> and properties, and are handled very differently internally — a View’s SQL
>>>>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>>> catalog and namespace from the time the view was created.
>>>>>>
>>>>>> Query planning is distinct between tables and views, so Spark doesn’t
>>>>>> benefit from combining them. I think it has actually caused problems that
>>>>>> both were resolved by the same method in v1: the resolution rule grew
>>>>>> extremely complicated trying to look up a reference just once because it
>>>>>> had to parse a view plan and resolve relations within it using the view’s
>>>>>> context (current database). In contrast, John’s new view substitution rules
>>>>>> are cleaner and can stay within the substitution batch.
>>>>>>
>>>>>> People implementing views would also not benefit from combining the
>>>>>> two interfaces:
>>>>>>
>>>>>>    - There is little overlap between View and Table, only schema and
>>>>>>    properties
>>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>>    ViewOrTable is more difficult for implementations
>>>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>>>    like John proposes, so we would have to break or replace that API
>>>>>>
>>>>>> I understand the initial appeal of combining TableCatalog and
>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>>> already stored together, and not because it made sense for users of the
>>>>>> API, or any other implementer of the API.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>>
>>>>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>>>>> from view, based on the stored schema, to make sure the view schema doesn't
>>>>>>>> change.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>>> validation done in *checkAnalysis*:
>>>>>>>
>>>>>>>           // If the view output doesn't have the same number of
>>>>>>> columns neither with the child
>>>>>>>           // output, nor with the query column names, throw an
>>>>>>> AnalysisException.
>>>>>>>           // If the view's child output can't up cast to the view
>>>>>>> output,
>>>>>>>           // throw an AnalysisException, too.
>>>>>>>
>>>>>>> The view output comes from the schema:
>>>>>>>
>>>>>>>       val child = View(
>>>>>>>         desc = metadata,
>>>>>>>         output = metadata.schema.toAttributes,
>>>>>>>         child = parser.parsePlan(viewText))
>>>>>>>
>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but
>>>>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
An extra RPC call is a concern for the catalog implementation. It is simple
to cache the result of a call to avoid a second one if the catalog chooses.

I don't think that an extra RPC that can be easily avoided is a reasonable
justification to add caches in Spark. For one thing, it doesn't solve the
problem because the proposed API still requires separate lookups for tables
and views.

The only solution that would help is to use a combined trait, but that has
issues. For one, view substitution is much cleaner when it happens well
before table resolution. And, View and Table are very different objects;
returning Object from this API doesn't make much sense.

One extra RPC is not unreasonable, and the choice should be left to
sources. That's the easiest place to cache results from the underlying
store.

On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cl...@gmail.com> wrote:

> Moving back the discussion to this thread. The current argument is how to
> avoid extra RPC calls for catalogs supporting both table and view. There
> are several options:
> 1. ignore it as extra PRC calls are cheap compared to the query execution
> 2. have a per session cache for loaded table/view
> 3. have a per query cache for loaded table/view
> 4. add a new trait TableViewCatalog
>
> I think it's important to avoid perf regression with new APIs. RPC calls
> can be significant for short queries. We may also double the RPC
> traffic which is bad for the metastore service. Normally I would not
> recommend caching as cache invalidation is a hard problem. Personally I
> prefer option 4 as it only affects catalogs that support both table and
> view, and it fits the hive catalog very well.
>
> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jz...@apache.org> wrote:
>
>> SPIP
>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>> has been updated. Please review.
>>
>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>
>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> Any updates here? I agree that a new View API is better, but we need a
>>>> solution to avoid performance regression. We need to elaborate on the cache
>>>> idea.
>>>>
>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> I think it is a good idea to keep tables and views separate.
>>>>>
>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>> function are the ones brought up in this thread. First, an identifier in a
>>>>> catalog must be either a view or a table and should not collide. Second, a
>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>> catalog, so I’ll primarily focus on the first.
>>>>>
>>>>> Table/view name collision is unlikely to be a problem. Metastores that
>>>>> support both today store them in a single namespace, so this is not a
>>>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>>>> know that a new metastore catalog could choose to implement both
>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>> would be a very strange choice: if the metastore itself has different
>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>> them through separate catalogs because Spark will always prefer one over
>>>>> the other.
>>>>>
>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>> tables are much more rare than catalogs that only expose one. For example,
>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>>> other representation into a SQL view. It would make little sense to combine
>>>>> that with some other TableCatalog.
>>>>>
>>>>> I also don’t think there is benefit from an API perspective to justify
>>>>> combining the Table and View interfaces. The two share only schema and
>>>>> properties, and are handled very differently internally — a View’s SQL
>>>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>> catalog and namespace from the time the view was created.
>>>>>
>>>>> Query planning is distinct between tables and views, so Spark doesn’t
>>>>> benefit from combining them. I think it has actually caused problems that
>>>>> both were resolved by the same method in v1: the resolution rule grew
>>>>> extremely complicated trying to look up a reference just once because it
>>>>> had to parse a view plan and resolve relations within it using the view’s
>>>>> context (current database). In contrast, John’s new view substitution rules
>>>>> are cleaner and can stay within the substitution batch.
>>>>>
>>>>> People implementing views would also not benefit from combining the
>>>>> two interfaces:
>>>>>
>>>>>    - There is little overlap between View and Table, only schema and
>>>>>    properties
>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>    ViewOrTable is more difficult for implementations
>>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>>    like John proposes, so we would have to break or replace that API
>>>>>
>>>>> I understand the initial appeal of combining TableCatalog and
>>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>> already stored together, and not because it made sense for users of the
>>>>> API, or any other implementer of the API.
>>>>>
>>>>> rb
>>>>>
>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>
>>>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>>>> from view, based on the stored schema, to make sure the view schema doesn't
>>>>>>> change.
>>>>>>>
>>>>>>
>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>> validation done in *checkAnalysis*:
>>>>>>
>>>>>>           // If the view output doesn't have the same number of
>>>>>> columns neither with the child
>>>>>>           // output, nor with the query column names, throw an
>>>>>> AnalysisException.
>>>>>>           // If the view's child output can't up cast to the view
>>>>>> output,
>>>>>>           // throw an AnalysisException, too.
>>>>>>
>>>>>> The view output comes from the schema:
>>>>>>
>>>>>>       val child = View(
>>>>>>         desc = metadata,
>>>>>>         output = metadata.schema.toAttributes,
>>>>>>         child = parser.parsePlan(viewText))
>>>>>>
>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but
>>>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
Moving back the discussion to this thread. The current argument is how to
avoid extra RPC calls for catalogs supporting both table and view. There
are several options:
1. ignore it as extra PRC calls are cheap compared to the query execution
2. have a per session cache for loaded table/view
3. have a per query cache for loaded table/view
4. add a new trait TableViewCatalog

I think it's important to avoid perf regression with new APIs. RPC calls
can be significant for short queries. We may also double the RPC
traffic which is bad for the metastore service. Normally I would not
recommend caching as cache invalidation is a hard problem. Personally I
prefer option 4 as it only affects catalogs that support both table and
view, and it fits the hive catalog very well.

On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jz...@apache.org> wrote:

> SPIP
> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
> has been updated. Please review.
>
> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jz...@apache.org> wrote:
>
>> Wenchen, sorry for the delay, I will post an update shortly.
>>
>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> Any updates here? I agree that a new View API is better, but we need a
>>> solution to avoid performance regression. We need to elaborate on the cache
>>> idea.
>>>
>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> I think it is a good idea to keep tables and views separate.
>>>>
>>>> The main two arguments I’ve heard for combining lookup into a single
>>>> function are the ones brought up in this thread. First, an identifier in a
>>>> catalog must be either a view or a table and should not collide. Second, a
>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>> concern is well addressed by caching, which we already do in the Spark
>>>> catalog, so I’ll primarily focus on the first.
>>>>
>>>> Table/view name collision is unlikely to be a problem. Metastores that
>>>> support both today store them in a single namespace, so this is not a
>>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>>> know that a new metastore catalog could choose to implement both
>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>> would be a very strange choice: if the metastore itself has different
>>>> namespaces for tables and views, then it makes much more sense to expose
>>>> them through separate catalogs because Spark will always prefer one over
>>>> the other.
>>>>
>>>> In a similar line of reasoning, catalogs that expose both views and
>>>> tables are much more rare than catalogs that only expose one. For example,
>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>> other representation into a SQL view. It would make little sense to combine
>>>> that with some other TableCatalog.
>>>>
>>>> I also don’t think there is benefit from an API perspective to justify
>>>> combining the Table and View interfaces. The two share only schema and
>>>> properties, and are handled very differently internally — a View’s SQL
>>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>>> SQL also needs additional context to be resolved correctly: the current
>>>> catalog and namespace from the time the view was created.
>>>>
>>>> Query planning is distinct between tables and views, so Spark doesn’t
>>>> benefit from combining them. I think it has actually caused problems that
>>>> both were resolved by the same method in v1: the resolution rule grew
>>>> extremely complicated trying to look up a reference just once because it
>>>> had to parse a view plan and resolve relations within it using the view’s
>>>> context (current database). In contrast, John’s new view substitution rules
>>>> are cleaner and can stay within the substitution batch.
>>>>
>>>> People implementing views would also not benefit from combining the two
>>>> interfaces:
>>>>
>>>>    - There is little overlap between View and Table, only schema and
>>>>    properties
>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>    ViewOrTable is more difficult for implementations
>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>    like John proposes, so we would have to break or replace that API
>>>>
>>>> I understand the initial appeal of combining TableCatalog and
>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>> think that Hive chose to do that mostly on the fact that the two were
>>>> already stored together, and not because it made sense for users of the
>>>> API, or any other implementer of the API.
>>>>
>>>> rb
>>>>
>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>
>>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>>> from view, based on the stored schema, to make sure the view schema doesn't
>>>>>> change.
>>>>>>
>>>>>
>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>> validation done in *checkAnalysis*:
>>>>>
>>>>>           // If the view output doesn't have the same number of
>>>>> columns neither with the child
>>>>>           // output, nor with the query column names, throw an
>>>>> AnalysisException.
>>>>>           // If the view's child output can't up cast to the view
>>>>> output,
>>>>>           // throw an AnalysisException, too.
>>>>>
>>>>> The view output comes from the schema:
>>>>>
>>>>>       val child = View(
>>>>>         desc = metadata,
>>>>>         output = metadata.schema.toAttributes,
>>>>>         child = parser.parsePlan(viewText))
>>>>>
>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but
>>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> John Zhuge
>

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
SPIP
<https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
has been updated. Please review.

On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jz...@apache.org> wrote:

> Wenchen, sorry for the delay, I will post an update shortly.
>
> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Any updates here? I agree that a new View API is better, but we need a
>> solution to avoid performance regression. We need to elaborate on the cache
>> idea.
>>
>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> I think it is a good idea to keep tables and views separate.
>>>
>>> The main two arguments I’ve heard for combining lookup into a single
>>> function are the ones brought up in this thread. First, an identifier in a
>>> catalog must be either a view or a table and should not collide. Second, a
>>> single lookup is more likely to require a single RPC. I think the RPC
>>> concern is well addressed by caching, which we already do in the Spark
>>> catalog, so I’ll primarily focus on the first.
>>>
>>> Table/view name collision is unlikely to be a problem. Metastores that
>>> support both today store them in a single namespace, so this is not a
>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>> know that a new metastore catalog could choose to implement both
>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>> would be a very strange choice: if the metastore itself has different
>>> namespaces for tables and views, then it makes much more sense to expose
>>> them through separate catalogs because Spark will always prefer one over
>>> the other.
>>>
>>> In a similar line of reasoning, catalogs that expose both views and
>>> tables are much more rare than catalogs that only expose one. For example,
>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>> and implementing ViewCatalog would make little sense. Exposing new data
>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>> likely to be the same. Say I have a way to convert Pig statements or some
>>> other representation into a SQL view. It would make little sense to combine
>>> that with some other TableCatalog.
>>>
>>> I also don’t think there is benefit from an API perspective to justify
>>> combining the Table and View interfaces. The two share only schema and
>>> properties, and are handled very differently internally — a View’s SQL
>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>> SQL also needs additional context to be resolved correctly: the current
>>> catalog and namespace from the time the view was created.
>>>
>>> Query planning is distinct between tables and views, so Spark doesn’t
>>> benefit from combining them. I think it has actually caused problems that
>>> both were resolved by the same method in v1: the resolution rule grew
>>> extremely complicated trying to look up a reference just once because it
>>> had to parse a view plan and resolve relations within it using the view’s
>>> context (current database). In contrast, John’s new view substitution rules
>>> are cleaner and can stay within the substitution batch.
>>>
>>> People implementing views would also not benefit from combining the two
>>> interfaces:
>>>
>>>    - There is little overlap between View and Table, only schema and
>>>    properties
>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>    ViewOrTable is more difficult for implementations
>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>    like John proposes, so we would have to break or replace that API
>>>
>>> I understand the initial appeal of combining TableCatalog and
>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>> think that Hive chose to do that mostly on the fact that the two were
>>> already stored together, and not because it made sense for users of the
>>> API, or any other implementer of the API.
>>>
>>> rb
>>>
>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>>
>>>>
>>>>
>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>
>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>> from view, based on the stored schema, to make sure the view schema doesn't
>>>>> change.
>>>>>
>>>>
>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>> validation done in *checkAnalysis*:
>>>>
>>>>           // If the view output doesn't have the same number of columns
>>>> neither with the child
>>>>           // output, nor with the query column names, throw an
>>>> AnalysisException.
>>>>           // If the view's child output can't up cast to the view
>>>> output,
>>>>           // throw an AnalysisException, too.
>>>>
>>>> The view output comes from the schema:
>>>>
>>>>       val child = View(
>>>>         desc = metadata,
>>>>         output = metadata.schema.toAttributes,
>>>>         child = parser.parsePlan(viewText))
>>>>
>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but
>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> John Zhuge
>


-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Wenchen, sorry for the delay, I will post an update shortly.

On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cl...@gmail.com> wrote:

> Any updates here? I agree that a new View API is better, but we need a
> solution to avoid performance regression. We need to elaborate on the cache
> idea.
>
> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> I think it is a good idea to keep tables and views separate.
>>
>> The main two arguments I’ve heard for combining lookup into a single
>> function are the ones brought up in this thread. First, an identifier in a
>> catalog must be either a view or a table and should not collide. Second, a
>> single lookup is more likely to require a single RPC. I think the RPC
>> concern is well addressed by caching, which we already do in the Spark
>> catalog, so I’ll primarily focus on the first.
>>
>> Table/view name collision is unlikely to be a problem. Metastores that
>> support both today store them in a single namespace, so this is not a
>> concern for even a naive implementation that talks to the Hive MetaStore. I
>> know that a new metastore catalog could choose to implement both
>> ViewCatalog and TableCatalog and store the two sets separately, but that
>> would be a very strange choice: if the metastore itself has different
>> namespaces for tables and views, then it makes much more sense to expose
>> them through separate catalogs because Spark will always prefer one over
>> the other.
>>
>> In a similar line of reasoning, catalogs that expose both views and
>> tables are much more rare than catalogs that only expose one. For example,
>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>> and implementing ViewCatalog would make little sense. Exposing new data
>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>> likely to be the same. Say I have a way to convert Pig statements or some
>> other representation into a SQL view. It would make little sense to combine
>> that with some other TableCatalog.
>>
>> I also don’t think there is benefit from an API perspective to justify
>> combining the Table and View interfaces. The two share only schema and
>> properties, and are handled very differently internally — a View’s SQL
>> query is parsed and substituted into the plan, while a Table is wrapped in
>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>> SQL also needs additional context to be resolved correctly: the current
>> catalog and namespace from the time the view was created.
>>
>> Query planning is distinct between tables and views, so Spark doesn’t
>> benefit from combining them. I think it has actually caused problems that
>> both were resolved by the same method in v1: the resolution rule grew
>> extremely complicated trying to look up a reference just once because it
>> had to parse a view plan and resolve relations within it using the view’s
>> context (current database). In contrast, John’s new view substitution rules
>> are cleaner and can stay within the substitution batch.
>>
>> People implementing views would also not benefit from combining the two
>> interfaces:
>>
>>    - There is little overlap between View and Table, only schema and
>>    properties
>>    - Most catalogs won’t implement both interfaces, so returning a
>>    ViewOrTable is more difficult for implementations
>>    - TableCatalog assumes that ViewCatalog will be added separately like
>>    John proposes, so we would have to break or replace that API
>>
>> I understand the initial appeal of combining TableCatalog and ViewCatalog
>> since it is done that way in the existing interfaces. But I think that Hive
>> chose to do that mostly on the fact that the two were already stored
>> together, and not because it made sense for users of the API, or any other
>> implementer of the API.
>>
>> rb
>>
>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org> wrote:
>>
>>>
>>>
>>>
>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>
>>>> Correction: Spark adds a new Project at the top of the parsed plan from
>>>> view, based on the stored schema, to make sure the view schema doesn't
>>>> change.
>>>>
>>>
>>> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
>>> done in *checkAnalysis*:
>>>
>>>           // If the view output doesn't have the same number of columns
>>> neither with the child
>>>           // output, nor with the query column names, throw an
>>> AnalysisException.
>>>           // If the view's child output can't up cast to the view output,
>>>           // throw an AnalysisException, too.
>>>
>>> The view output comes from the schema:
>>>
>>>       val child = View(
>>>         desc = metadata,
>>>         output = metadata.schema.toAttributes,
>>>         child = parser.parsePlan(viewText))
>>>
>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
>>> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
>>> out in SPIP. I will add a new paragraph accordingly.
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
Any updates here? I agree that a new View API is better, but we need a
solution to avoid performance regression. We need to elaborate on the cache
idea.

On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:

> I think it is a good idea to keep tables and views separate.
>
> The main two arguments I’ve heard for combining lookup into a single
> function are the ones brought up in this thread. First, an identifier in a
> catalog must be either a view or a table and should not collide. Second, a
> single lookup is more likely to require a single RPC. I think the RPC
> concern is well addressed by caching, which we already do in the Spark
> catalog, so I’ll primarily focus on the first.
>
> Table/view name collision is unlikely to be a problem. Metastores that
> support both today store them in a single namespace, so this is not a
> concern for even a naive implementation that talks to the Hive MetaStore. I
> know that a new metastore catalog could choose to implement both
> ViewCatalog and TableCatalog and store the two sets separately, but that
> would be a very strange choice: if the metastore itself has different
> namespaces for tables and views, then it makes much more sense to expose
> them through separate catalogs because Spark will always prefer one over
> the other.
>
> In a similar line of reasoning, catalogs that expose both views and tables
> are much more rare than catalogs that only expose one. For example, v2
> catalogs for JDBC and Cassandra expose data through the Table interface and
> implementing ViewCatalog would make little sense. Exposing new data sources
> to Spark requires TableCatalog, not ViewCatalog. View catalogs are likely
> to be the same. Say I have a way to convert Pig statements or some other
> representation into a SQL view. It would make little sense to combine that
> with some other TableCatalog.
>
> I also don’t think there is benefit from an API perspective to justify
> combining the Table and View interfaces. The two share only schema and
> properties, and are handled very differently internally — a View’s SQL
> query is parsed and substituted into the plan, while a Table is wrapped in
> a relation that eventually becomes a Scan node using SupportsRead. A view’s
> SQL also needs additional context to be resolved correctly: the current
> catalog and namespace from the time the view was created.
>
> Query planning is distinct between tables and views, so Spark doesn’t
> benefit from combining them. I think it has actually caused problems that
> both were resolved by the same method in v1: the resolution rule grew
> extremely complicated trying to look up a reference just once because it
> had to parse a view plan and resolve relations within it using the view’s
> context (current database). In contrast, John’s new view substitution rules
> are cleaner and can stay within the substitution batch.
>
> People implementing views would also not benefit from combining the two
> interfaces:
>
>    - There is little overlap between View and Table, only schema and
>    properties
>    - Most catalogs won’t implement both interfaces, so returning a
>    ViewOrTable is more difficult for implementations
>    - TableCatalog assumes that ViewCatalog will be added separately like
>    John proposes, so we would have to break or replace that API
>
> I understand the initial appeal of combining TableCatalog and ViewCatalog
> since it is done that way in the existing interfaces. But I think that Hive
> chose to do that mostly on the fact that the two were already stored
> together, and not because it made sense for users of the API, or any other
> implementer of the API.
>
> rb
>
> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org> wrote:
>
>>
>>
>>
>>> > AFAIK view schema is only used by DESCRIBE.
>>>
>>> Correction: Spark adds a new Project at the top of the parsed plan from
>>> view, based on the stored schema, to make sure the view schema doesn't
>>> change.
>>>
>>
>> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
>> done in *checkAnalysis*:
>>
>>           // If the view output doesn't have the same number of columns
>> neither with the child
>>           // output, nor with the query column names, throw an
>> AnalysisException.
>>           // If the view's child output can't up cast to the view output,
>>           // throw an AnalysisException, too.
>>
>> The view output comes from the schema:
>>
>>       val child = View(
>>         desc = metadata,
>>         output = metadata.schema.toAttributes,
>>         child = parser.parsePlan(viewText))
>>
>> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
>> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
>> out in SPIP. I will add a new paragraph accordingly.
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: SPIP: Catalog API for view metadata

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I think it is a good idea to keep tables and views separate.

The main two arguments I’ve heard for combining lookup into a single
function are the ones brought up in this thread. First, an identifier in a
catalog must be either a view or a table and should not collide. Second, a
single lookup is more likely to require a single RPC. I think the RPC
concern is well addressed by caching, which we already do in the Spark
catalog, so I’ll primarily focus on the first.

Table/view name collision is unlikely to be a problem. Metastores that
support both today store them in a single namespace, so this is not a
concern for even a naive implementation that talks to the Hive MetaStore. I
know that a new metastore catalog could choose to implement both
ViewCatalog and TableCatalog and store the two sets separately, but that
would be a very strange choice: if the metastore itself has different
namespaces for tables and views, then it makes much more sense to expose
them through separate catalogs because Spark will always prefer one over
the other.

In a similar line of reasoning, catalogs that expose both views and tables
are much more rare than catalogs that only expose one. For example, v2
catalogs for JDBC and Cassandra expose data through the Table interface and
implementing ViewCatalog would make little sense. Exposing new data sources
to Spark requires TableCatalog, not ViewCatalog. View catalogs are likely
to be the same. Say I have a way to convert Pig statements or some other
representation into a SQL view. It would make little sense to combine that
with some other TableCatalog.

I also don’t think there is benefit from an API perspective to justify
combining the Table and View interfaces. The two share only schema and
properties, and are handled very differently internally — a View’s SQL
query is parsed and substituted into the plan, while a Table is wrapped in
a relation that eventually becomes a Scan node using SupportsRead. A view’s
SQL also needs additional context to be resolved correctly: the current
catalog and namespace from the time the view was created.

Query planning is distinct between tables and views, so Spark doesn’t
benefit from combining them. I think it has actually caused problems that
both were resolved by the same method in v1: the resolution rule grew
extremely complicated trying to look up a reference just once because it
had to parse a view plan and resolve relations within it using the view’s
context (current database). In contrast, John’s new view substitution rules
are cleaner and can stay within the substitution batch.

People implementing views would also not benefit from combining the two
interfaces:

   - There is little overlap between View and Table, only schema and
   properties
   - Most catalogs won’t implement both interfaces, so returning a
   ViewOrTable is more difficult for implementations
   - TableCatalog assumes that ViewCatalog will be added separately like
   John proposes, so we would have to break or replace that API

I understand the initial appeal of combining TableCatalog and ViewCatalog
since it is done that way in the existing interfaces. But I think that Hive
chose to do that mostly on the fact that the two were already stored
together, and not because it made sense for users of the API, or any other
implementer of the API.

rb

On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jz...@apache.org> wrote:

>
>
>
>> > AFAIK view schema is only used by DESCRIBE.
>>
>> Correction: Spark adds a new Project at the top of the parsed plan from
>> view, based on the stored schema, to make sure the view schema doesn't
>> change.
>>
>
> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
> done in *checkAnalysis*:
>
>           // If the view output doesn't have the same number of columns
> neither with the child
>           // output, nor with the query column names, throw an
> AnalysisException.
>           // If the view's child output can't up cast to the view output,
>           // throw an AnalysisException, too.
>
> The view output comes from the schema:
>
>       val child = View(
>         desc = metadata,
>         output = metadata.schema.toAttributes,
>         child = parser.parsePlan(viewText))
>
> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
> out in SPIP. I will add a new paragraph accordingly.
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
> > AFAIK view schema is only used by DESCRIBE.
>
> Correction: Spark adds a new Project at the top of the parsed plan from
> view, based on the stored schema, to make sure the view schema doesn't
> change.
>

Thanks Wenchen! I thought I forgot something :) Yes it is the validation
done in *checkAnalysis*:

          // If the view output doesn't have the same number of columns
neither with the child
          // output, nor with the query column names, throw an
AnalysisException.
          // If the view's child output can't up cast to the view output,
          // throw an AnalysisException, too.

The view output comes from the schema:

      val child = View(
        desc = metadata,
        output = metadata.schema.toAttributes,
        child = parser.parsePlan(viewText))

So it is a validation (here) or cache (in DESCRIBE) nice to have but not
"required" or "should be frozen". Thanks Ryan and Burak for pointing that
out in SPIP. I will add a new paragraph accordingly.

Re: SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
> AFAIK view schema is only used by DESCRIBE.

Correction: Spark adds a new Project at the top of the parsed plan from
view, based on the stored schema, to make sure the view schema doesn't
change.

Can you update your doc to incorporate the cache idea? Let's make sure we
don't have perf issues if we go with the new View API.

On Tue, Aug 18, 2020 at 4:25 PM John Zhuge <jz...@apache.org> wrote:

> Thanks Burak and Walaa for the feedback!
>
> Here are my perspectives:
>
> We shouldn't be persisting things like the schema for a view
>
>
> This is not related to which option to choose because existing code
> persists schema as well.
> When resolving the view, the analyzer always parses the view sql text, it
> does not use the schema.
>
>> AFAIK view schema is only used by DESCRIBE.
>
>
>> Why not use TableCatalog.loadTable to load both tables and views
>>
> Also, views can be defined on top of either other views or base tables, so
>> the less divergence in code paths between views and tables the better.
>
>
> Existing Spark takes this approach and there are quite a few checks like
> "tableType == CatalogTableType.VIEW".
> View and table metadata surprisingly have very little in common, thus I'd
> like to group view related code together, separate from table processing.
> Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution
> in the same "Substitution" batch as CTESubstitution.
>
> This way you avoid multiple RPCs to a catalog or data source or metastore,
>> and you avoid namespace/name conflits. Also you make yourself less
>> susceptible to race conditions (which still inherently exist).
>>
>
> Valid concern. Can be mitigated by caching RPC calls in the catalog
> implementation. The window for race condition can also be narrowed
> significantly but not totally eliminated.
>
>
> On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa <
> wa.moustafa@gmail.com> wrote:
>
>> Wenchen, agreed with what you said. I was referring to situations where
>> the underlying table schema evolves (say by introducing a nested field in a
>> Struct), and also what you mentioned in cases of SELECT *. The Hive
>> metastore handling of those does not automatically update view schema (even
>> though executing the view in Hive results in data that has the most recent
>> schema when underlying tables evolve -- so newly added nested field data
>> shows up in the view evaluation query result but not in the view schema).
>>
>> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> View should have a fixed schema like a table. It should either be
>>> inferred from the query when creating the view, or be specified by the user
>>> manually like CREATE VIEW v(a, b) AS SELECT.... Users can still alter
>>> view schema manually.
>>>
>>> Basically a view is just a named SQL query, which mostly has fixed
>>> schema unless you do something like SELECT *.
>>>
>>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
>>> wa.moustafa@gmail.com> wrote:
>>>
>>>> +1 to making views as special forms of tables. Sometimes a table can be
>>>> converted to a view to hide some of the implementation details while not
>>>> impacting readers (provided that the write path is controlled). Also, views
>>>> can be defined on top of either other views or base tables, so the less
>>>> divergence in code paths between views and tables the better.
>>>>
>>>> For whether to materialize view schema or infer it, one of the issues
>>>> we face with the HMS approach of materialization is that when the
>>>> underlying table schema evolves, HMS will still keep the view schema
>>>> unchanged. This causes a number of discrepancies that we address
>>>> out-of-band (e.g., run separate pipeline to ensure view schema freshness,
>>>> or just re-derive it at read time (example derivation algorithm for
>>>> view Avro schema
>>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>>>> )).
>>>>
>>>> Also regarding SupportsRead vs SupportWrite, some views can be
>>>> updateable (example from MySQL
>>>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but
>>>> also implementing that requires a few concepts that are more prominent in
>>>> an RDBMS.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <br...@gmail.com> wrote:
>>>>
>>>>> My high level comment here is that as a naive person, I would expect a
>>>>> View to be a special form of Table that SupportsRead but doesn't
>>>>> SupportWrite. loadTable in the TableCatalog API should load both tables and
>>>>> views. This way you avoid multiple RPCs to a catalog or data source or
>>>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>>>> less susceptible to race conditions (which still inherently exist).
>>>>>
>>>>> In addition, I'm not a SQL expert, but I thought that views are
>>>>> evaluated at runtime, therefore we shouldn't be persisting things like the
>>>>> schema for a view.
>>>>>
>>>>> What do people think of making Views a special form of Table?
>>>>>
>>>>> Best,
>>>>> Burak
>>>>>
>>>>>
>>>>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Thanks Ryan.
>>>>>>
>>>>>> ViewCatalog API mimics TableCatalog API including how shared
>>>>>> namespace is handled:
>>>>>>
>>>>>>    - The doc for createView
>>>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> states
>>>>>>    "it will throw ViewAlreadyExistsException when a view or table already
>>>>>>    exists for the identifier."
>>>>>>    - The doc for loadView
>>>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> states
>>>>>>    "If the catalog supports tables and contains a table for the identifier and
>>>>>>    not a view, this must throw NoSuchViewException."
>>>>>>
>>>>>> Agree it is good to explicitly specify the order of resolution. I
>>>>>> will add a section in ViewCatalog javadoc to summarize the behavior for
>>>>>> "shared namespace". The loadView doc will also be updated to spell out the
>>>>>> order of resolution.
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree with Wenchen that we need to be clear about resolution and
>>>>>>> behavior. For example, I think that we would agree that CREATE VIEW
>>>>>>> catalog.schema.name should fail when there is a table named
>>>>>>> catalog.schema.name. We’ve already included this behavior in the
>>>>>>> documentation for the TableCatalog API
>>>>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>>>>>> where create should fail if a view exists for the identifier.
>>>>>>>
>>>>>>> I think it was simply assumed that we would use the same approach —
>>>>>>> the API requires that table and view names share a namespace. But it would
>>>>>>> be good to specifically note either the order in which resolution will
>>>>>>> happen (views are resolved first) or note that it is not allowed and
>>>>>>> behavior is not guaranteed. I prefer the first option.
>>>>>>>
>>>>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Wenchen,
>>>>>>>>
>>>>>>>> Thanks for the feedback!
>>>>>>>>
>>>>>>>> 1. Add a new View API. How to avoid name conflicts between table
>>>>>>>>> and view? When resolving relation, shall we lookup table catalog first or
>>>>>>>>> view catalog?
>>>>>>>>
>>>>>>>>
>>>>>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>>>>>
>>>>>>>>    - The proposed new view substitution rule and the changes to
>>>>>>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>>>>>>    "dual" catalog.
>>>>>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>>>>>       -  Creating a view in view catalog when a table of the same
>>>>>>>>       name exists should fail.
>>>>>>>>       -  Creating a table in table catalog when a view of the same
>>>>>>>>       name exists should fail as well.
>>>>>>>>
>>>>>>>> Agree with you that a new View API is more flexible. A couple of
>>>>>>>> notes:
>>>>>>>>
>>>>>>>>    - We actually started a common view prototype using the single
>>>>>>>>    catalog approach, but once we added more and more view metadata, storing
>>>>>>>>    them in table properties became not manageable, especially for the feature
>>>>>>>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>>>>>>>    - We'd like to move away from Hive metastore
>>>>>>>>
>>>>>>>> For more details and discussion, see SPIP section "Background and
>>>>>>>> Motivation".
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> John
>>>>>>>>
>>>>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi John,
>>>>>>>>>
>>>>>>>>> Thanks for working on this! View support is very important to the
>>>>>>>>> catalog plugin API.
>>>>>>>>>
>>>>>>>>> After reading your doc, I have one high-level question: should
>>>>>>>>> view be a separated API or it's just a special type of table?
>>>>>>>>>
>>>>>>>>> AFAIK in most databases, tables and views share the same
>>>>>>>>> namespace. You can't create a view if a same-name table exists. In Hive,
>>>>>>>>> view is just a special type of table, so they are in the same namespace
>>>>>>>>> naturally. If we have both table catalog and view catalog, we need a
>>>>>>>>> mechanism to make sure there are no name conflicts.
>>>>>>>>>
>>>>>>>>> On the other hand, the view metadata is very simple that can be
>>>>>>>>> put in table properties. I'd like to see more thoughts to evaluate these 2
>>>>>>>>> approaches:
>>>>>>>>> 1. *Add a new View API*. How to avoid name conflicts between
>>>>>>>>> table and view? When resolving relation, shall we lookup table catalog
>>>>>>>>> first or view catalog?
>>>>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we
>>>>>>>>> do want to store table and views separately?
>>>>>>>>>
>>>>>>>>> I think a new View API is more flexible. I'd vote for it if we can
>>>>>>>>> come up with a good mechanism to avoid name conflicts.
>>>>>>>>>
>>>>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Spark devs,
>>>>>>>>>>
>>>>>>>>>> I'd like to bring more attention to this SPIP. As
>>>>>>>>>> Dongjoon indicated in the email "Apache Spark 3.1 Feature Expectation (Dec.
>>>>>>>>>> 2020)", this feature can be considered for 3.2 or even 3.1.
>>>>>>>>>>
>>>>>>>>>> View catalog builds on top of the catalog plugin system
>>>>>>>>>> introduced in DataSourceV2. It adds the “ViewCatalog” API to load, create,
>>>>>>>>>> alter, and drop views. A catalog plugin can naturally implement both
>>>>>>>>>> ViewCatalog and TableCatalog.
>>>>>>>>>>
>>>>>>>>>> Our internal implementation has been in production for over 8
>>>>>>>>>> months. Recently we extended it to support materialized views, for the read
>>>>>>>>>> path initially.
>>>>>>>>>>
>>>>>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>>>>>>> to load, create, alter, and drop views.
>>>>>>>>>>>
>>>>>>>>>>> Document:
>>>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>>>>>
>>>>>>>>>>> As part of a project to support common views across query
>>>>>>>>>>> engines like Spark and Presto, my team used the view catalog API in Spark
>>>>>>>>>>> implementation. The project has been in production over three months.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> John Zhuge
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> John Zhuge
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> John Zhuge
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>
> --
> John Zhuge
>

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Thanks Burak and Walaa for the feedback!

Here are my perspectives:

We shouldn't be persisting things like the schema for a view


This is not related to which option to choose because existing code
persists schema as well.
When resolving the view, the analyzer always parses the view sql text, it
does not use the schema.

> AFAIK view schema is only used by DESCRIBE.


> Why not use TableCatalog.loadTable to load both tables and views
>
Also, views can be defined on top of either other views or base tables, so
> the less divergence in code paths between views and tables the better.


Existing Spark takes this approach and there are quite a few checks like
"tableType == CatalogTableType.VIEW".
View and table metadata surprisingly have very little in common, thus I'd
like to group view related code together, separate from table processing.
Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution in
the same "Substitution" batch as CTESubstitution.

This way you avoid multiple RPCs to a catalog or data source or metastore,
> and you avoid namespace/name conflits. Also you make yourself less
> susceptible to race conditions (which still inherently exist).
>

Valid concern. Can be mitigated by caching RPC calls in the catalog
implementation. The window for race condition can also be narrowed
significantly but not totally eliminated.


On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa <wa...@gmail.com>
wrote:

> Wenchen, agreed with what you said. I was referring to situations where
> the underlying table schema evolves (say by introducing a nested field in a
> Struct), and also what you mentioned in cases of SELECT *. The Hive
> metastore handling of those does not automatically update view schema (even
> though executing the view in Hive results in data that has the most recent
> schema when underlying tables evolve -- so newly added nested field data
> shows up in the view evaluation query result but not in the view schema).
>
> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> View should have a fixed schema like a table. It should either be
>> inferred from the query when creating the view, or be specified by the user
>> manually like CREATE VIEW v(a, b) AS SELECT.... Users can still alter
>> view schema manually.
>>
>> Basically a view is just a named SQL query, which mostly has fixed schema
>> unless you do something like SELECT *.
>>
>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
>> wa.moustafa@gmail.com> wrote:
>>
>>> +1 to making views as special forms of tables. Sometimes a table can be
>>> converted to a view to hide some of the implementation details while not
>>> impacting readers (provided that the write path is controlled). Also, views
>>> can be defined on top of either other views or base tables, so the less
>>> divergence in code paths between views and tables the better.
>>>
>>> For whether to materialize view schema or infer it, one of the issues we
>>> face with the HMS approach of materialization is that when the underlying
>>> table schema evolves, HMS will still keep the view schema unchanged. This
>>> causes a number of discrepancies that we address out-of-band (e.g., run
>>> separate pipeline to ensure view schema freshness, or just re-derive it at
>>> read time (example derivation algorithm for view Avro schema
>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>>> )).
>>>
>>> Also regarding SupportsRead vs SupportWrite, some views can be
>>> updateable (example from MySQL
>>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but
>>> also implementing that requires a few concepts that are more prominent in
>>> an RDBMS.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <br...@gmail.com> wrote:
>>>
>>>> My high level comment here is that as a naive person, I would expect a
>>>> View to be a special form of Table that SupportsRead but doesn't
>>>> SupportWrite. loadTable in the TableCatalog API should load both tables and
>>>> views. This way you avoid multiple RPCs to a catalog or data source or
>>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>>> less susceptible to race conditions (which still inherently exist).
>>>>
>>>> In addition, I'm not a SQL expert, but I thought that views are
>>>> evaluated at runtime, therefore we shouldn't be persisting things like the
>>>> schema for a view.
>>>>
>>>> What do people think of making Views a special form of Table?
>>>>
>>>> Best,
>>>> Burak
>>>>
>>>>
>>>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Thanks Ryan.
>>>>>
>>>>> ViewCatalog API mimics TableCatalog API including how shared namespace
>>>>> is handled:
>>>>>
>>>>>    - The doc for createView
>>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> states
>>>>>    "it will throw ViewAlreadyExistsException when a view or table already
>>>>>    exists for the identifier."
>>>>>    - The doc for loadView
>>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> states
>>>>>    "If the catalog supports tables and contains a table for the identifier and
>>>>>    not a view, this must throw NoSuchViewException."
>>>>>
>>>>> Agree it is good to explicitly specify the order of resolution. I will
>>>>> add a section in ViewCatalog javadoc to summarize the behavior for "shared
>>>>> namespace". The loadView doc will also be updated to spell out the order of
>>>>> resolution.
>>>>>
>>>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I agree with Wenchen that we need to be clear about resolution and
>>>>>> behavior. For example, I think that we would agree that CREATE VIEW
>>>>>> catalog.schema.name should fail when there is a table named
>>>>>> catalog.schema.name. We’ve already included this behavior in the
>>>>>> documentation for the TableCatalog API
>>>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>>>>> where create should fail if a view exists for the identifier.
>>>>>>
>>>>>> I think it was simply assumed that we would use the same approach —
>>>>>> the API requires that table and view names share a namespace. But it would
>>>>>> be good to specifically note either the order in which resolution will
>>>>>> happen (views are resolved first) or note that it is not allowed and
>>>>>> behavior is not guaranteed. I prefer the first option.
>>>>>>
>>>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org> wrote:
>>>>>>
>>>>>>> Hi Wenchen,
>>>>>>>
>>>>>>> Thanks for the feedback!
>>>>>>>
>>>>>>> 1. Add a new View API. How to avoid name conflicts between table and
>>>>>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>>>>>> catalog?
>>>>>>>
>>>>>>>
>>>>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>>>>
>>>>>>>    - The proposed new view substitution rule and the changes to
>>>>>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>>>>>    "dual" catalog.
>>>>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>>>>       -  Creating a view in view catalog when a table of the same
>>>>>>>       name exists should fail.
>>>>>>>       -  Creating a table in table catalog when a view of the same
>>>>>>>       name exists should fail as well.
>>>>>>>
>>>>>>> Agree with you that a new View API is more flexible. A couple of
>>>>>>> notes:
>>>>>>>
>>>>>>>    - We actually started a common view prototype using the single
>>>>>>>    catalog approach, but once we added more and more view metadata, storing
>>>>>>>    them in table properties became not manageable, especially for the feature
>>>>>>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>>>>>>    - We'd like to move away from Hive metastore
>>>>>>>
>>>>>>> For more details and discussion, see SPIP section "Background and
>>>>>>> Motivation".
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John
>>>>>>>
>>>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> Thanks for working on this! View support is very important to the
>>>>>>>> catalog plugin API.
>>>>>>>>
>>>>>>>> After reading your doc, I have one high-level question: should view
>>>>>>>> be a separated API or it's just a special type of table?
>>>>>>>>
>>>>>>>> AFAIK in most databases, tables and views share the same namespace.
>>>>>>>> You can't create a view if a same-name table exists. In Hive, view is just
>>>>>>>> a special type of table, so they are in the same namespace naturally. If we
>>>>>>>> have both table catalog and view catalog, we need a mechanism to make sure
>>>>>>>> there are no name conflicts.
>>>>>>>>
>>>>>>>> On the other hand, the view metadata is very simple that can be put
>>>>>>>> in table properties. I'd like to see more thoughts to evaluate these 2
>>>>>>>> approaches:
>>>>>>>> 1. *Add a new View API*. How to avoid name conflicts between table
>>>>>>>> and view? When resolving relation, shall we lookup table catalog first or
>>>>>>>> view catalog?
>>>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we
>>>>>>>> do want to store table and views separately?
>>>>>>>>
>>>>>>>> I think a new View API is more flexible. I'd vote for it if we can
>>>>>>>> come up with a good mechanism to avoid name conflicts.
>>>>>>>>
>>>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Spark devs,
>>>>>>>>>
>>>>>>>>> I'd like to bring more attention to this SPIP. As
>>>>>>>>> Dongjoon indicated in the email "Apache Spark 3.1 Feature Expectation (Dec.
>>>>>>>>> 2020)", this feature can be considered for 3.2 or even 3.1.
>>>>>>>>>
>>>>>>>>> View catalog builds on top of the catalog plugin system introduced
>>>>>>>>> in DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>>>>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>>>>>>> TableCatalog.
>>>>>>>>>
>>>>>>>>> Our internal implementation has been in production for over 8
>>>>>>>>> months. Recently we extended it to support materialized views, for the read
>>>>>>>>> path initially.
>>>>>>>>>
>>>>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>>>>>> to load, create, alter, and drop views.
>>>>>>>>>>
>>>>>>>>>> Document:
>>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>>>>
>>>>>>>>>> As part of a project to support common views across query engines
>>>>>>>>>> like Spark and Presto, my team used the view catalog API in Spark
>>>>>>>>>> implementation. The project has been in production over three months.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> John Zhuge
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> John Zhuge
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>

-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.
Wenchen, agreed with what you said. I was referring to situations where the
underlying table schema evolves (say by introducing a nested field in a
Struct), and also what you mentioned in cases of SELECT *. The Hive
metastore handling of those does not automatically update view schema (even
though executing the view in Hive results in data that has the most recent
schema when underlying tables evolve -- so newly added nested field data
shows up in the view evaluation query result but not in the view schema).

On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan <cl...@gmail.com> wrote:

> View should have a fixed schema like a table. It should either be inferred
> from the query when creating the view, or be specified by the user manually
> like CREATE VIEW v(a, b) AS SELECT.... Users can still alter view schema
> manually.
>
> Basically a view is just a named SQL query, which mostly has fixed schema
> unless you do something like SELECT *.
>
> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
> wa.moustafa@gmail.com> wrote:
>
>> +1 to making views as special forms of tables. Sometimes a table can be
>> converted to a view to hide some of the implementation details while not
>> impacting readers (provided that the write path is controlled). Also, views
>> can be defined on top of either other views or base tables, so the less
>> divergence in code paths between views and tables the better.
>>
>> For whether to materialize view schema or infer it, one of the issues we
>> face with the HMS approach of materialization is that when the underlying
>> table schema evolves, HMS will still keep the view schema unchanged. This
>> causes a number of discrepancies that we address out-of-band (e.g., run
>> separate pipeline to ensure view schema freshness, or just re-derive it at
>> read time (example derivation algorithm for view Avro schema
>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>> )).
>>
>> Also regarding SupportsRead vs SupportWrite, some views can be updateable
>> (example from MySQL
>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but
>> also implementing that requires a few concepts that are more prominent in
>> an RDBMS.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <br...@gmail.com> wrote:
>>
>>> My high level comment here is that as a naive person, I would expect a
>>> View to be a special form of Table that SupportsRead but doesn't
>>> SupportWrite. loadTable in the TableCatalog API should load both tables and
>>> views. This way you avoid multiple RPCs to a catalog or data source or
>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>> less susceptible to race conditions (which still inherently exist).
>>>
>>> In addition, I'm not a SQL expert, but I thought that views are
>>> evaluated at runtime, therefore we shouldn't be persisting things like the
>>> schema for a view.
>>>
>>> What do people think of making Views a special form of Table?
>>>
>>> Best,
>>> Burak
>>>
>>>
>>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Thanks Ryan.
>>>>
>>>> ViewCatalog API mimics TableCatalog API including how shared namespace
>>>> is handled:
>>>>
>>>>    - The doc for createView
>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> states
>>>>    "it will throw ViewAlreadyExistsException when a view or table already
>>>>    exists for the identifier."
>>>>    - The doc for loadView
>>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> states
>>>>    "If the catalog supports tables and contains a table for the identifier and
>>>>    not a view, this must throw NoSuchViewException."
>>>>
>>>> Agree it is good to explicitly specify the order of resolution. I will
>>>> add a section in ViewCatalog javadoc to summarize the behavior for "shared
>>>> namespace". The loadView doc will also be updated to spell out the order of
>>>> resolution.
>>>>
>>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> I agree with Wenchen that we need to be clear about resolution and
>>>>> behavior. For example, I think that we would agree that CREATE VIEW
>>>>> catalog.schema.name should fail when there is a table named
>>>>> catalog.schema.name. We’ve already included this behavior in the
>>>>> documentation for the TableCatalog API
>>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>>>> where create should fail if a view exists for the identifier.
>>>>>
>>>>> I think it was simply assumed that we would use the same approach —
>>>>> the API requires that table and view names share a namespace. But it would
>>>>> be good to specifically note either the order in which resolution will
>>>>> happen (views are resolved first) or note that it is not allowed and
>>>>> behavior is not guaranteed. I prefer the first option.
>>>>>
>>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Hi Wenchen,
>>>>>>
>>>>>> Thanks for the feedback!
>>>>>>
>>>>>> 1. Add a new View API. How to avoid name conflicts between table and
>>>>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>>>>> catalog?
>>>>>>
>>>>>>
>>>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>>>
>>>>>>    - The proposed new view substitution rule and the changes to
>>>>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>>>>    "dual" catalog.
>>>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>>>       -  Creating a view in view catalog when a table of the same
>>>>>>       name exists should fail.
>>>>>>       -  Creating a table in table catalog when a view of the same
>>>>>>       name exists should fail as well.
>>>>>>
>>>>>> Agree with you that a new View API is more flexible. A couple of
>>>>>> notes:
>>>>>>
>>>>>>    - We actually started a common view prototype using the single
>>>>>>    catalog approach, but once we added more and more view metadata, storing
>>>>>>    them in table properties became not manageable, especially for the feature
>>>>>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>>>>>    - We'd like to move away from Hive metastore
>>>>>>
>>>>>> For more details and discussion, see SPIP section "Background and
>>>>>> Motivation".
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>
>>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> Thanks for working on this! View support is very important to the
>>>>>>> catalog plugin API.
>>>>>>>
>>>>>>> After reading your doc, I have one high-level question: should view
>>>>>>> be a separated API or it's just a special type of table?
>>>>>>>
>>>>>>> AFAIK in most databases, tables and views share the same namespace.
>>>>>>> You can't create a view if a same-name table exists. In Hive, view is just
>>>>>>> a special type of table, so they are in the same namespace naturally. If we
>>>>>>> have both table catalog and view catalog, we need a mechanism to make sure
>>>>>>> there are no name conflicts.
>>>>>>>
>>>>>>> On the other hand, the view metadata is very simple that can be put
>>>>>>> in table properties. I'd like to see more thoughts to evaluate these 2
>>>>>>> approaches:
>>>>>>> 1. *Add a new View API*. How to avoid name conflicts between table
>>>>>>> and view? When resolving relation, shall we lookup table catalog first or
>>>>>>> view catalog?
>>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we
>>>>>>> do want to store table and views separately?
>>>>>>>
>>>>>>> I think a new View API is more flexible. I'd vote for it if we can
>>>>>>> come up with a good mechanism to avoid name conflicts.
>>>>>>>
>>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Spark devs,
>>>>>>>>
>>>>>>>> I'd like to bring more attention to this SPIP. As
>>>>>>>> Dongjoon indicated in the email "Apache Spark 3.1 Feature Expectation (Dec.
>>>>>>>> 2020)", this feature can be considered for 3.2 or even 3.1.
>>>>>>>>
>>>>>>>> View catalog builds on top of the catalog plugin system introduced
>>>>>>>> in DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>>>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>>>>>> TableCatalog.
>>>>>>>>
>>>>>>>> Our internal implementation has been in production for over 8
>>>>>>>> months. Recently we extended it to support materialized views, for the read
>>>>>>>> path initially.
>>>>>>>>
>>>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>>>>> to load, create, alter, and drop views.
>>>>>>>>>
>>>>>>>>> Document:
>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>>>
>>>>>>>>> As part of a project to support common views across query engines
>>>>>>>>> like Spark and Presto, my team used the view catalog API in Spark
>>>>>>>>> implementation. The project has been in production over three months.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> John Zhuge
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> John Zhuge
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>

Re: SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
View should have a fixed schema like a table. It should either be inferred
from the query when creating the view, or be specified by the user manually
like CREATE VIEW v(a, b) AS SELECT.... Users can still alter view schema
manually.

Basically a view is just a named SQL query, which mostly has fixed schema
unless you do something like SELECT *.

On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <wa...@gmail.com>
wrote:

> +1 to making views as special forms of tables. Sometimes a table can be
> converted to a view to hide some of the implementation details while not
> impacting readers (provided that the write path is controlled). Also, views
> can be defined on top of either other views or base tables, so the less
> divergence in code paths between views and tables the better.
>
> For whether to materialize view schema or infer it, one of the issues we
> face with the HMS approach of materialization is that when the underlying
> table schema evolves, HMS will still keep the view schema unchanged. This
> causes a number of discrepancies that we address out-of-band (e.g., run
> separate pipeline to ensure view schema freshness, or just re-derive it at
> read time (example derivation algorithm for view Avro schema
> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
> )).
>
> Also regarding SupportsRead vs SupportWrite, some views can be updateable
> (example from MySQL
> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but also
> implementing that requires a few concepts that are more prominent in an
> RDBMS.
>
> Thanks,
> Walaa.
>
>
> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <br...@gmail.com> wrote:
>
>> My high level comment here is that as a naive person, I would expect a
>> View to be a special form of Table that SupportsRead but doesn't
>> SupportWrite. loadTable in the TableCatalog API should load both tables and
>> views. This way you avoid multiple RPCs to a catalog or data source or
>> metastore, and you avoid namespace/name conflits. Also you make yourself
>> less susceptible to race conditions (which still inherently exist).
>>
>> In addition, I'm not a SQL expert, but I thought that views are evaluated
>> at runtime, therefore we shouldn't be persisting things like the schema for
>> a view.
>>
>> What do people think of making Views a special form of Table?
>>
>> Best,
>> Burak
>>
>>
>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jz...@apache.org> wrote:
>>
>>> Thanks Ryan.
>>>
>>> ViewCatalog API mimics TableCatalog API including how shared namespace
>>> is handled:
>>>
>>>    - The doc for createView
>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> states
>>>    "it will throw ViewAlreadyExistsException when a view or table already
>>>    exists for the identifier."
>>>    - The doc for loadView
>>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> states
>>>    "If the catalog supports tables and contains a table for the identifier and
>>>    not a view, this must throw NoSuchViewException."
>>>
>>> Agree it is good to explicitly specify the order of resolution. I will
>>> add a section in ViewCatalog javadoc to summarize the behavior for "shared
>>> namespace". The loadView doc will also be updated to spell out the order of
>>> resolution.
>>>
>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I agree with Wenchen that we need to be clear about resolution and
>>>> behavior. For example, I think that we would agree that CREATE VIEW
>>>> catalog.schema.name should fail when there is a table named
>>>> catalog.schema.name. We’ve already included this behavior in the
>>>> documentation for the TableCatalog API
>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>>> where create should fail if a view exists for the identifier.
>>>>
>>>> I think it was simply assumed that we would use the same approach — the
>>>> API requires that table and view names share a namespace. But it would be
>>>> good to specifically note either the order in which resolution will happen
>>>> (views are resolved first) or note that it is not allowed and behavior is
>>>> not guaranteed. I prefer the first option.
>>>>
>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Hi Wenchen,
>>>>>
>>>>> Thanks for the feedback!
>>>>>
>>>>> 1. Add a new View API. How to avoid name conflicts between table and
>>>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>>>> catalog?
>>>>>
>>>>>
>>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>>
>>>>>    - The proposed new view substitution rule and the changes to
>>>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>>>    "dual" catalog.
>>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>>       -  Creating a view in view catalog when a table of the same
>>>>>       name exists should fail.
>>>>>       -  Creating a table in table catalog when a view of the same
>>>>>       name exists should fail as well.
>>>>>
>>>>> Agree with you that a new View API is more flexible. A couple of notes:
>>>>>
>>>>>    - We actually started a common view prototype using the single
>>>>>    catalog approach, but once we added more and more view metadata, storing
>>>>>    them in table properties became not manageable, especially for the feature
>>>>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>>>>    - We'd like to move away from Hive metastore
>>>>>
>>>>> For more details and discussion, see SPIP section "Background and
>>>>> Motivation".
>>>>>
>>>>> Thanks,
>>>>> John
>>>>>
>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi John,
>>>>>>
>>>>>> Thanks for working on this! View support is very important to the
>>>>>> catalog plugin API.
>>>>>>
>>>>>> After reading your doc, I have one high-level question: should view
>>>>>> be a separated API or it's just a special type of table?
>>>>>>
>>>>>> AFAIK in most databases, tables and views share the same namespace.
>>>>>> You can't create a view if a same-name table exists. In Hive, view is just
>>>>>> a special type of table, so they are in the same namespace naturally. If we
>>>>>> have both table catalog and view catalog, we need a mechanism to make sure
>>>>>> there are no name conflicts.
>>>>>>
>>>>>> On the other hand, the view metadata is very simple that can be put
>>>>>> in table properties. I'd like to see more thoughts to evaluate these 2
>>>>>> approaches:
>>>>>> 1. *Add a new View API*. How to avoid name conflicts between table
>>>>>> and view? When resolving relation, shall we lookup table catalog first or
>>>>>> view catalog?
>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>>>>>> want to store table and views separately?
>>>>>>
>>>>>> I think a new View API is more flexible. I'd vote for it if we can
>>>>>> come up with a good mechanism to avoid name conflicts.
>>>>>>
>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org> wrote:
>>>>>>
>>>>>>> Hi Spark devs,
>>>>>>>
>>>>>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated
>>>>>>> in the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this
>>>>>>> feature can be considered for 3.2 or even 3.1.
>>>>>>>
>>>>>>> View catalog builds on top of the catalog plugin system introduced
>>>>>>> in DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>>>>> TableCatalog.
>>>>>>>
>>>>>>> Our internal implementation has been in production for over 8
>>>>>>> months. Recently we extended it to support materialized views, for the read
>>>>>>> path initially.
>>>>>>>
>>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>>>> to load, create, alter, and drop views.
>>>>>>>>
>>>>>>>> Document:
>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>>
>>>>>>>> As part of a project to support common views across query engines
>>>>>>>> like Spark and Presto, my team used the view catalog API in Spark
>>>>>>>> implementation. The project has been in production over three months.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> John Zhuge
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>

Re: SPIP: Catalog API for view metadata

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.
+1 to making views as special forms of tables. Sometimes a table can be
converted to a view to hide some of the implementation details while not
impacting readers (provided that the write path is controlled). Also, views
can be defined on top of either other views or base tables, so the less
divergence in code paths between views and tables the better.

For whether to materialize view schema or infer it, one of the issues we
face with the HMS approach of materialization is that when the underlying
table schema evolves, HMS will still keep the view schema unchanged. This
causes a number of discrepancies that we address out-of-band (e.g., run
separate pipeline to ensure view schema freshness, or just re-derive it at
read time (example derivation algorithm for view Avro schema
<https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
)).

Also regarding SupportsRead vs SupportWrite, some views can be updateable
(example from MySQL
https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but also
implementing that requires a few concepts that are more prominent in an
RDBMS.

Thanks,
Walaa.


On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <br...@gmail.com> wrote:

> My high level comment here is that as a naive person, I would expect a
> View to be a special form of Table that SupportsRead but doesn't
> SupportWrite. loadTable in the TableCatalog API should load both tables and
> views. This way you avoid multiple RPCs to a catalog or data source or
> metastore, and you avoid namespace/name conflits. Also you make yourself
> less susceptible to race conditions (which still inherently exist).
>
> In addition, I'm not a SQL expert, but I thought that views are evaluated
> at runtime, therefore we shouldn't be persisting things like the schema for
> a view.
>
> What do people think of making Views a special form of Table?
>
> Best,
> Burak
>
>
> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jz...@apache.org> wrote:
>
>> Thanks Ryan.
>>
>> ViewCatalog API mimics TableCatalog API including how shared namespace is
>> handled:
>>
>>    - The doc for createView
>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> states
>>    "it will throw ViewAlreadyExistsException when a view or table already
>>    exists for the identifier."
>>    - The doc for loadView
>>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> states
>>    "If the catalog supports tables and contains a table for the identifier and
>>    not a view, this must throw NoSuchViewException."
>>
>> Agree it is good to explicitly specify the order of resolution. I will
>> add a section in ViewCatalog javadoc to summarize the behavior for "shared
>> namespace". The loadView doc will also be updated to spell out the order of
>> resolution.
>>
>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I agree with Wenchen that we need to be clear about resolution and
>>> behavior. For example, I think that we would agree that CREATE VIEW
>>> catalog.schema.name should fail when there is a table named
>>> catalog.schema.name. We’ve already included this behavior in the
>>> documentation for the TableCatalog API
>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>> where create should fail if a view exists for the identifier.
>>>
>>> I think it was simply assumed that we would use the same approach — the
>>> API requires that table and view names share a namespace. But it would be
>>> good to specifically note either the order in which resolution will happen
>>> (views are resolved first) or note that it is not allowed and behavior is
>>> not guaranteed. I prefer the first option.
>>>
>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Hi Wenchen,
>>>>
>>>> Thanks for the feedback!
>>>>
>>>> 1. Add a new View API. How to avoid name conflicts between table and
>>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>>> catalog?
>>>>
>>>>
>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>
>>>>    - The proposed new view substitution rule and the changes to
>>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>>    "dual" catalog.
>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>       -  Creating a view in view catalog when a table of the same name
>>>>       exists should fail.
>>>>       -  Creating a table in table catalog when a view of the same
>>>>       name exists should fail as well.
>>>>
>>>> Agree with you that a new View API is more flexible. A couple of notes:
>>>>
>>>>    - We actually started a common view prototype using the single
>>>>    catalog approach, but once we added more and more view metadata, storing
>>>>    them in table properties became not manageable, especially for the feature
>>>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>>>    - We'd like to move away from Hive metastore
>>>>
>>>> For more details and discussion, see SPIP section "Background and
>>>> Motivation".
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> Thanks for working on this! View support is very important to the
>>>>> catalog plugin API.
>>>>>
>>>>> After reading your doc, I have one high-level question: should view be
>>>>> a separated API or it's just a special type of table?
>>>>>
>>>>> AFAIK in most databases, tables and views share the same namespace.
>>>>> You can't create a view if a same-name table exists. In Hive, view is just
>>>>> a special type of table, so they are in the same namespace naturally. If we
>>>>> have both table catalog and view catalog, we need a mechanism to make sure
>>>>> there are no name conflicts.
>>>>>
>>>>> On the other hand, the view metadata is very simple that can be put in
>>>>> table properties. I'd like to see more thoughts to evaluate these 2
>>>>> approaches:
>>>>> 1. *Add a new View API*. How to avoid name conflicts between table
>>>>> and view? When resolving relation, shall we lookup table catalog first or
>>>>> view catalog?
>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>>>>> want to store table and views separately?
>>>>>
>>>>> I think a new View API is more flexible. I'd vote for it if we can
>>>>> come up with a good mechanism to avoid name conflicts.
>>>>>
>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated
>>>>>> in the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this
>>>>>> feature can be considered for 3.2 or even 3.1.
>>>>>>
>>>>>> View catalog builds on top of the catalog plugin system introduced in
>>>>>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>>>> TableCatalog.
>>>>>>
>>>>>> Our internal implementation has been in production for over 8 months.
>>>>>> Recently we extended it to support materialized views, for the read path
>>>>>> initially.
>>>>>>
>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>>> to load, create, alter, and drop views.
>>>>>>>
>>>>>>> Document:
>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>
>>>>>>> As part of a project to support common views across query engines
>>>>>>> like Spark and Presto, my team used the view catalog API in Spark
>>>>>>> implementation. The project has been in production over three months.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>

Re: SPIP: Catalog API for view metadata

Posted by Burak Yavuz <br...@gmail.com>.
My high level comment here is that as a naive person, I would expect a View
to be a special form of Table that SupportsRead but doesn't SupportWrite.
loadTable in the TableCatalog API should load both tables and views. This
way you avoid multiple RPCs to a catalog or data source or metastore, and
you avoid namespace/name conflits. Also you make yourself less susceptible
to race conditions (which still inherently exist).

In addition, I'm not a SQL expert, but I thought that views are evaluated
at runtime, therefore we shouldn't be persisting things like the schema for
a view.

What do people think of making Views a special form of Table?

Best,
Burak


On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jz...@apache.org> wrote:

> Thanks Ryan.
>
> ViewCatalog API mimics TableCatalog API including how shared namespace is
> handled:
>
>    - The doc for createView
>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> states
>    "it will throw ViewAlreadyExistsException when a view or table already
>    exists for the identifier."
>    - The doc for loadView
>    <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> states
>    "If the catalog supports tables and contains a table for the identifier and
>    not a view, this must throw NoSuchViewException."
>
> Agree it is good to explicitly specify the order of resolution. I will add
> a section in ViewCatalog javadoc to summarize the behavior for "shared
> namespace". The loadView doc will also be updated to spell out the order of
> resolution.
>
> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I agree with Wenchen that we need to be clear about resolution and
>> behavior. For example, I think that we would agree that CREATE VIEW
>> catalog.schema.name should fail when there is a table named
>> catalog.schema.name. We’ve already included this behavior in the
>> documentation for the TableCatalog API
>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>> where create should fail if a view exists for the identifier.
>>
>> I think it was simply assumed that we would use the same approach — the
>> API requires that table and view names share a namespace. But it would be
>> good to specifically note either the order in which resolution will happen
>> (views are resolved first) or note that it is not allowed and behavior is
>> not guaranteed. I prefer the first option.
>>
>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org> wrote:
>>
>>> Hi Wenchen,
>>>
>>> Thanks for the feedback!
>>>
>>> 1. Add a new View API. How to avoid name conflicts between table and
>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>> catalog?
>>>
>>>
>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>
>>>    - The proposed new view substitution rule and the changes to
>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>    "dual" catalog.
>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>       -  Creating a view in view catalog when a table of the same name
>>>       exists should fail.
>>>       -  Creating a table in table catalog when a view of the same name
>>>       exists should fail as well.
>>>
>>> Agree with you that a new View API is more flexible. A couple of notes:
>>>
>>>    - We actually started a common view prototype using the single
>>>    catalog approach, but once we added more and more view metadata, storing
>>>    them in table properties became not manageable, especially for the feature
>>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>>    - We'd like to move away from Hive metastore
>>>
>>> For more details and discussion, see SPIP section "Background and
>>> Motivation".
>>>
>>> Thanks,
>>> John
>>>
>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> Thanks for working on this! View support is very important to the
>>>> catalog plugin API.
>>>>
>>>> After reading your doc, I have one high-level question: should view be
>>>> a separated API or it's just a special type of table?
>>>>
>>>> AFAIK in most databases, tables and views share the same namespace. You
>>>> can't create a view if a same-name table exists. In Hive, view is just a
>>>> special type of table, so they are in the same namespace naturally. If we
>>>> have both table catalog and view catalog, we need a mechanism to make sure
>>>> there are no name conflicts.
>>>>
>>>> On the other hand, the view metadata is very simple that can be put in
>>>> table properties. I'd like to see more thoughts to evaluate these 2
>>>> approaches:
>>>> 1. *Add a new View API*. How to avoid name conflicts between table and
>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>> catalog?
>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>>>> want to store table and views separately?
>>>>
>>>> I think a new View API is more flexible. I'd vote for it if we can come
>>>> up with a good mechanism to avoid name conflicts.
>>>>
>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Hi Spark devs,
>>>>>
>>>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated
>>>>> in the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this
>>>>> feature can be considered for 3.2 or even 3.1.
>>>>>
>>>>> View catalog builds on top of the catalog plugin system introduced in
>>>>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>>> TableCatalog.
>>>>>
>>>>> Our internal implementation has been in production for over 8 months.
>>>>> Recently we extended it to support materialized views, for the read path
>>>>> initially.
>>>>>
>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>> to load, create, alter, and drop views.
>>>>>>
>>>>>> Document:
>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>
>>>>>> As part of a project to support common views across query engines
>>>>>> like Spark and Presto, my team used the view catalog API in Spark
>>>>>> implementation. The project has been in production over three months.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Thanks Ryan.

ViewCatalog API mimics TableCatalog API including how shared namespace is
handled:

   - The doc for createView
   <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109>
states
   "it will throw ViewAlreadyExistsException when a view or table already
   exists for the identifier."
   - The doc for loadView
   <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75>
states
   "If the catalog supports tables and contains a table for the identifier and
   not a view, this must throw NoSuchViewException."

Agree it is good to explicitly specify the order of resolution. I will add
a section in ViewCatalog javadoc to summarize the behavior for "shared
namespace". The loadView doc will also be updated to spell out the order of
resolution.

On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I agree with Wenchen that we need to be clear about resolution and
> behavior. For example, I think that we would agree that CREATE VIEW
> catalog.schema.name should fail when there is a table named
> catalog.schema.name. We’ve already included this behavior in the
> documentation for the TableCatalog API
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
> where create should fail if a view exists for the identifier.
>
> I think it was simply assumed that we would use the same approach — the
> API requires that table and view names share a namespace. But it would be
> good to specifically note either the order in which resolution will happen
> (views are resolved first) or note that it is not allowed and behavior is
> not guaranteed. I prefer the first option.
>
> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org> wrote:
>
>> Hi Wenchen,
>>
>> Thanks for the feedback!
>>
>> 1. Add a new View API. How to avoid name conflicts between table and
>>> view? When resolving relation, shall we lookup table catalog first or view
>>> catalog?
>>
>>
>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>
>>    - The proposed new view substitution rule and the changes to
>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>    "dual" catalog.
>>    - The implementation for a "dual" catalog plugin should ensure:
>>       -  Creating a view in view catalog when a table of the same name
>>       exists should fail.
>>       -  Creating a table in table catalog when a view of the same name
>>       exists should fail as well.
>>
>> Agree with you that a new View API is more flexible. A couple of notes:
>>
>>    - We actually started a common view prototype using the single
>>    catalog approach, but once we added more and more view metadata, storing
>>    them in table properties became not manageable, especially for the feature
>>    like "versioning". Eventually we opted for a view backend of S3 JSON files.
>>    - We'd like to move away from Hive metastore
>>
>> For more details and discussion, see SPIP section "Background and
>> Motivation".
>>
>> Thanks,
>> John
>>
>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> Hi John,
>>>
>>> Thanks for working on this! View support is very important to the
>>> catalog plugin API.
>>>
>>> After reading your doc, I have one high-level question: should view be a
>>> separated API or it's just a special type of table?
>>>
>>> AFAIK in most databases, tables and views share the same namespace. You
>>> can't create a view if a same-name table exists. In Hive, view is just a
>>> special type of table, so they are in the same namespace naturally. If we
>>> have both table catalog and view catalog, we need a mechanism to make sure
>>> there are no name conflicts.
>>>
>>> On the other hand, the view metadata is very simple that can be put in
>>> table properties. I'd like to see more thoughts to evaluate these 2
>>> approaches:
>>> 1. *Add a new View API*. How to avoid name conflicts between table and
>>> view? When resolving relation, shall we lookup table catalog first or view
>>> catalog?
>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>>> want to store table and views separately?
>>>
>>> I think a new View API is more flexible. I'd vote for it if we can come
>>> up with a good mechanism to avoid name conflicts.
>>>
>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Hi Spark devs,
>>>>
>>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated in
>>>> the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature
>>>> can be considered for 3.2 or even 3.1.
>>>>
>>>> View catalog builds on top of the catalog plugin system introduced in
>>>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>> TableCatalog.
>>>>
>>>> Our internal implementation has been in production for over 8 months.
>>>> Recently we extended it to support materialized views, for the read path
>>>> initially.
>>>>
>>>> The PR has conflicts that I will resolve them shortly.
>>>>
>>>> Thanks,
>>>>
>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> In order to disassociate view metadata from Hive Metastore and support
>>>>> different storage backends, I am proposing a new view catalog API to load,
>>>>> create, alter, and drop views.
>>>>>
>>>>> Document:
>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>
>>>>> As part of a project to support common views across query engines like
>>>>> Spark and Presto, my team used the view catalog API in Spark
>>>>> implementation. The project has been in production over three months.
>>>>>
>>>>> Thanks,
>>>>> John Zhuge
>>>>>
>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I agree with Wenchen that we need to be clear about resolution and
behavior. For example, I think that we would agree that CREATE VIEW
catalog.schema.name should fail when there is a table named
catalog.schema.name. We’ve already included this behavior in the
documentation for the TableCatalog API
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
where create should fail if a view exists for the identifier.

I think it was simply assumed that we would use the same approach — the API
requires that table and view names share a namespace. But it would be good
to specifically note either the order in which resolution will happen
(views are resolved first) or note that it is not allowed and behavior is
not guaranteed. I prefer the first option.

On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jz...@apache.org> wrote:

> Hi Wenchen,
>
> Thanks for the feedback!
>
> 1. Add a new View API. How to avoid name conflicts between table and view?
>> When resolving relation, shall we lookup table catalog first or view
>> catalog?
>
>
>  See clarification in SPIP section "Proposed Changes - Namespace":
>
>    - The proposed new view substitution rule and the changes to
>    ResolveCatalogs should ensure the view catalog is looked up first for a
>    "dual" catalog.
>    - The implementation for a "dual" catalog plugin should ensure:
>       -  Creating a view in view catalog when a table of the same name
>       exists should fail.
>       -  Creating a table in table catalog when a view of the same name
>       exists should fail as well.
>
> Agree with you that a new View API is more flexible. A couple of notes:
>
>    - We actually started a common view prototype using the single catalog
>    approach, but once we added more and more view metadata, storing them in
>    table properties became not manageable, especially for the feature like
>    "versioning". Eventually we opted for a view backend of S3 JSON files.
>    - We'd like to move away from Hive metastore
>
> For more details and discussion, see SPIP section "Background and
> Motivation".
>
> Thanks,
> John
>
> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> Hi John,
>>
>> Thanks for working on this! View support is very important to the catalog
>> plugin API.
>>
>> After reading your doc, I have one high-level question: should view be a
>> separated API or it's just a special type of table?
>>
>> AFAIK in most databases, tables and views share the same namespace. You
>> can't create a view if a same-name table exists. In Hive, view is just a
>> special type of table, so they are in the same namespace naturally. If we
>> have both table catalog and view catalog, we need a mechanism to make sure
>> there are no name conflicts.
>>
>> On the other hand, the view metadata is very simple that can be put in
>> table properties. I'd like to see more thoughts to evaluate these 2
>> approaches:
>> 1. *Add a new View API*. How to avoid name conflicts between table and
>> view? When resolving relation, shall we lookup table catalog first or view
>> catalog?
>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>> want to store table and views separately?
>>
>> I think a new View API is more flexible. I'd vote for it if we can come
>> up with a good mechanism to avoid name conflicts.
>>
>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> Hi Spark devs,
>>>
>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated in
>>> the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature
>>> can be considered for 3.2 or even 3.1.
>>>
>>> View catalog builds on top of the catalog plugin system introduced in
>>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>> TableCatalog.
>>>
>>> Our internal implementation has been in production for over 8 months.
>>> Recently we extended it to support materialized views, for the read path
>>> initially.
>>>
>>> The PR has conflicts that I will resolve them shortly.
>>>
>>> Thanks,
>>>
>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> In order to disassociate view metadata from Hive Metastore and support
>>>> different storage backends, I am proposing a new view catalog API to load,
>>>> create, alter, and drop views.
>>>>
>>>> Document:
>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>
>>>> As part of a project to support common views across query engines like
>>>> Spark and Presto, my team used the view catalog API in Spark
>>>> implementation. The project has been in production over three months.
>>>>
>>>> Thanks,
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> John Zhuge
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: SPIP: Catalog API for view metadata

Posted by John Zhuge <jz...@apache.org>.
Hi Wenchen,

Thanks for the feedback!

1. Add a new View API. How to avoid name conflicts between table and view?
> When resolving relation, shall we lookup table catalog first or view
> catalog?


 See clarification in SPIP section "Proposed Changes - Namespace":

   - The proposed new view substitution rule and the changes to
   ResolveCatalogs should ensure the view catalog is looked up first for a
   "dual" catalog.
   - The implementation for a "dual" catalog plugin should ensure:
      -  Creating a view in view catalog when a table of the same name
      exists should fail.
      -  Creating a table in table catalog when a view of the same name
      exists should fail as well.

Agree with you that a new View API is more flexible. A couple of notes:

   - We actually started a common view prototype using the single catalog
   approach, but once we added more and more view metadata, storing them in
   table properties became not manageable, especially for the feature like
   "versioning". Eventually we opted for a view backend of S3 JSON files.
   - We'd like to move away from Hive metastore

For more details and discussion, see SPIP section "Background and
Motivation".

Thanks,
John

On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cl...@gmail.com> wrote:

> Hi John,
>
> Thanks for working on this! View support is very important to the catalog
> plugin API.
>
> After reading your doc, I have one high-level question: should view be a
> separated API or it's just a special type of table?
>
> AFAIK in most databases, tables and views share the same namespace. You
> can't create a view if a same-name table exists. In Hive, view is just a
> special type of table, so they are in the same namespace naturally. If we
> have both table catalog and view catalog, we need a mechanism to make sure
> there are no name conflicts.
>
> On the other hand, the view metadata is very simple that can be put in
> table properties. I'd like to see more thoughts to evaluate these 2
> approaches:
> 1. *Add a new View API*. How to avoid name conflicts between table and
> view? When resolving relation, shall we lookup table catalog first or view
> catalog?
> 2. *Reuse the Table API*. How to indicate it's a view? What if we do want
> to store table and views separately?
>
> I think a new View API is more flexible. I'd vote for it if we can come up
> with a good mechanism to avoid name conflicts.
>
> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org> wrote:
>
>> Hi Spark devs,
>>
>> I'd like to bring more attention to this SPIP. As Dongjoon indicated in
>> the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature
>> can be considered for 3.2 or even 3.1.
>>
>> View catalog builds on top of the catalog plugin system introduced in
>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>> TableCatalog.
>>
>> Our internal implementation has been in production for over 8 months.
>> Recently we extended it to support materialized views, for the read path
>> initially.
>>
>> The PR has conflicts that I will resolve them shortly.
>>
>> Thanks,
>>
>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> In order to disassociate view metadata from Hive Metastore and support
>>> different storage backends, I am proposing a new view catalog API to load,
>>> create, alter, and drop views.
>>>
>>> Document:
>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>
>>> As part of a project to support common views across query engines like
>>> Spark and Presto, my team used the view catalog API in Spark
>>> implementation. The project has been in production over three months.
>>>
>>> Thanks,
>>> John Zhuge
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Posted by Wenchen Fan <cl...@gmail.com>.
Hi John,

Thanks for working on this! View support is very important to the catalog
plugin API.

After reading your doc, I have one high-level question: should view be a
separated API or it's just a special type of table?

AFAIK in most databases, tables and views share the same namespace. You
can't create a view if a same-name table exists. In Hive, view is just a
special type of table, so they are in the same namespace naturally. If we
have both table catalog and view catalog, we need a mechanism to make sure
there are no name conflicts.

On the other hand, the view metadata is very simple that can be put in
table properties. I'd like to see more thoughts to evaluate these 2
approaches:
1. *Add a new View API*. How to avoid name conflicts between table and
view? When resolving relation, shall we lookup table catalog first or view
catalog?
2. *Reuse the Table API*. How to indicate it's a view? What if we do want
to store table and views separately?

I think a new View API is more flexible. I'd vote for it if we can come up
with a good mechanism to avoid name conflicts.

On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jz...@apache.org> wrote:

> Hi Spark devs,
>
> I'd like to bring more attention to this SPIP. As Dongjoon indicated in
> the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature
> can be considered for 3.2 or even 3.1.
>
> View catalog builds on top of the catalog plugin system introduced in
> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
> drop views. A catalog plugin can naturally implement both ViewCatalog and
> TableCatalog.
>
> Our internal implementation has been in production for over 8 months.
> Recently we extended it to support materialized views, for the read path
> initially.
>
> The PR has conflicts that I will resolve them shortly.
>
> Thanks,
>
> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jz...@apache.org> wrote:
>
>> Hi everyone,
>>
>> In order to disassociate view metadata from Hive Metastore and support
>> different storage backends, I am proposing a new view catalog API to load,
>> create, alter, and drop views.
>>
>> Document:
>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>> WIP PR: https://github.com/apache/spark/pull/28147
>>
>> As part of a project to support common views across query engines like
>> Spark and Presto, my team used the view catalog API in Spark
>> implementation. The project has been in production over three months.
>>
>> Thanks,
>> John Zhuge
>>
>
>
> --
> John Zhuge
>