You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Wenchen Fan <cl...@gmail.com> on 2019/10/14 10:16:56 UTC

Re: [DISCUSS] ViewCatalog interface for DSv2

I'm fine with the view definition proposed here, but my major concern is
how to make sure table/view share the same namespace. According to the SQL
spec, if there is a view named "a", we can't create a table named "a"
anymore.

We can add documents and ask the implementation to guarantee it, but it's
better if this can be guaranteed by the API.

On Wed, Aug 14, 2019 at 1:46 AM John Zhuge <jz...@apache.org> wrote:

> Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if
> that makes sense.
>
> I can't find out a lot about view resolution and validation in SQL Spec
> Part1. Anybody with full SQL knowledge may chime in.
>
> Here are my understanding based on online manuals, docs, and other
> resources:
>
>    - A view has a name in the database schema so that other queries can
>    use it like a table.
>    - A view's schema is frozen at the time the view is created;
>    subsequent changes to underlying tables (e.g. adding a column) will not be
>    reflected in the view's schema. If an underlying table is dropped or
>    changed in an incompatible fashion, subsequent attempts to query the
>    invalid view will fail.
>
> In Preso, view columns are used for validation only (see
> StatementAnalyzer.Visitor#isViewStale):
>
>    - view column names must match the visible fields of analyzed view sql
>    - the visible fields can be coerced to view column types
>
> In Spark 2.2+, view columns are also used for validation (see
> CheckAnalysis#checkAnalysis case View):
>
>    - view column names must match the output fields of the view sql
>    - view column types must be able to UpCast to output field types
>
> Rule EliminateView adds a Project to viewQueryColumnNames if it exists.
>
> As for `softwareVersion`, the purpose is to track which software version
> is used to create the view, in preparation for different versions of the
> same software or even different softwares, such as Presto vs Spark.
>
>
> On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> Thanks for working on this, John!
>>
>> I'd like to see a more complete write-up of what you're proposing.
>> Without that, I don't think we can have a productive discussion about this.
>>
>> For example, I think you're proposing to keep the view columns to ensure
>> that the same columns are produced by the view every time, based on
>> requirements from the SQL spec. Let's start by stating what those behavior
>> requirements are, so that everyone has the context to understand why your
>> proposal includes the view columns. Similarly, I'd like to know why you're
>> proposing `softwareVersion` in the view definition.
>>
>> On Tue, Aug 13, 2019 at 8:56 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> Catalog support has been added to DSv2 along with a table catalog
>>> interface. Here I'd like to propose a view catalog interface, for the
>>> following benefit:
>>>
>>>    - Abstraction for view management thus allowing different view
>>>    backends
>>>    - Disassociation of view definition storage from Hive Metastore
>>>
>>> A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
>>> identifier as view first then table.
>>>
>>> More details in SPIP and PR if we decide to proceed. Here is a quick
>>> glance at the API:
>>>
>>> ViewCatalog interface:
>>>
>>>    - loadView
>>>    - listViews
>>>    - createView
>>>    - deleteView
>>>
>>> View interface:
>>>
>>>    - name
>>>    - originalSql
>>>    - defaultCatalog
>>>    - defaultNamespace
>>>    - viewColumns
>>>    - owner
>>>    - createTime
>>>    - softwareVersion
>>>    - options (map)
>>>
>>> ViewColumn interface:
>>>
>>>    - name
>>>    - type
>>>
>>>
>>> Thanks,
>>> John Zhuge
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>

Re: [DISCUSS] ViewCatalog interface for DSv2

Posted by John Zhuge <jz...@apache.org>.

Thanks for the feedback. I am preparing a doc and a PoC, will post soon.

On Mon, Oct 14, 2019 at 3:17 AM Wenchen Fan <cl...@gmail.com> wrote:

> I'm fine with the view definition proposed here, but my major concern is
> how to make sure table/view share the same namespace. According to the SQL
> spec, if there is a view named "a", we can't create a table named "a"
> anymore.
>
> We can add documents and ask the implementation to guarantee it, but it's
> better if this can be guaranteed by the API.
>
> On Wed, Aug 14, 2019 at 1:46 AM John Zhuge <jz...@apache.org> wrote:
>
>> Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if
>> that makes sense.
>>
>> I can't find out a lot about view resolution and validation in SQL Spec
>> Part1. Anybody with full SQL knowledge may chime in.
>>
>> Here are my understanding based on online manuals, docs, and other
>> resources:
>>
>>    - A view has a name in the database schema so that other queries can
>>    use it like a table.
>>    - A view's schema is frozen at the time the view is created;
>>    subsequent changes to underlying tables (e.g. adding a column) will not be
>>    reflected in the view's schema. If an underlying table is dropped or
>>    changed in an incompatible fashion, subsequent attempts to query the
>>    invalid view will fail.
>>
>> In Preso, view columns are used for validation only (see
>> StatementAnalyzer.Visitor#isViewStale):
>>
>>    - view column names must match the visible fields of analyzed view sql
>>    - the visible fields can be coerced to view column types
>>
>> In Spark 2.2+, view columns are also used for validation (see
>> CheckAnalysis#checkAnalysis case View):
>>
>>    - view column names must match the output fields of the view sql
>>    - view column types must be able to UpCast to output field types
>>
>> Rule EliminateView adds a Project to viewQueryColumnNames if it exists.
>>
>> As for `softwareVersion`, the purpose is to track which software version
>> is used to create the view, in preparation for different versions of the
>> same software or even different softwares, such as Presto vs Spark.
>>
>>
>> On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Thanks for working on this, John!
>>>
>>> I'd like to see a more complete write-up of what you're proposing.
>>> Without that, I don't think we can have a productive discussion about this.
>>>
>>> For example, I think you're proposing to keep the view columns to ensure
>>> that the same columns are produced by the view every time, based on
>>> requirements from the SQL spec. Let's start by stating what those behavior
>>> requirements are, so that everyone has the context to understand why your
>>> proposal includes the view columns. Similarly, I'd like to know why you're
>>> proposing `softwareVersion` in the view definition.
>>>
>>> On Tue, Aug 13, 2019 at 8:56 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> Catalog support has been added to DSv2 along with a table catalog
>>>> interface. Here I'd like to propose a view catalog interface, for the
>>>> following benefit:
>>>>
>>>>    - Abstraction for view management thus allowing different view
>>>>    backends
>>>>    - Disassociation of view definition storage from Hive Metastore
>>>>
>>>> A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
>>>> identifier as view first then table.
>>>>
>>>> More details in SPIP and PR if we decide to proceed. Here is a quick
>>>> glance at the API:
>>>>
>>>> ViewCatalog interface:
>>>>
>>>>    - loadView
>>>>    - listViews
>>>>    - createView
>>>>    - deleteView
>>>>
>>>> View interface:
>>>>
>>>>    - name
>>>>    - originalSql
>>>>    - defaultCatalog
>>>>    - defaultNamespace
>>>>    - viewColumns
>>>>    - owner
>>>>    - createTime
>>>>    - softwareVersion
>>>>    - options (map)
>>>>
>>>> ViewColumn interface:
>>>>
>>>>    - name
>>>>    - type
>>>>
>>>>
>>>> Thanks,
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge