You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Ryan Blue <bl...@tabular.io> on 2021/08/22 17:02:11 UTC

Re: Proposal: Support for views in Iceberg

Thanks for working on this, Anjali! It’s great to see the thorough
discussion here with everyone.

For the discussion about SQL dialect, I think that the right first step is
to capture the SQL or query dialect. That will give us the most
flexibility. Engines that can use Coral for translation can attempt to
convert and engines that don’t can see if the SQL is valid and can be used.

I think that the idea to create a minimal IR is an interesting one, but can
be added later. We will always need to record the SQL and dialect, even if
we translate to IR because users are going to configure views using SQL.
Uses like showing view history or debugging need to show the original SQL,
plus relevant information like where it was created and the SQL dialect. We
should be able to add this later by adding additional metadata to the view
definition. I don’t think that it would introduce breaking changes to add a
common representation that can be optionally consumed.

Let’s continue talking about a minimal IR, separately. View translation is
a hard problem. Right now, to get views across engines we have to
compromise confidence. IR is a way to have strong confidence, but with
limited expressibility. I think that’s a good trade in a lot of cases and
is worth pursuing, even if it will take a long time.

Jacques makes a great point about types, but I think that the right option
here is to continue using Iceberg types. We’ve already had discussions
about whether Iceberg should support annotating types with engine-specific
ones, so we have a reasonable way to improve this while also providing
compatibility across engines: char(n) is not necessarily supported
everywhere and mapping it to string will make sense in most places. The
schema is primarily used to validate that the data produced by the query
hasn’t changed and that is more about the number of columns in structs and
the names of fields rather than exact types. We can fix up types when
substituting without losing too much: if the SQL produces a varchar(10)
field that the view metadata says is a string, then it’s okay that it is
varchar(10). There is some loss in that we don’t know if it was originally
varchar(5), but I think that this is not going to cause too many issues.
Not all engines will even validate that the schema has not changed, since
it could be valid to use select * from x where ... and allow new fields to
appear.

Right now, I think we should move forward with the proposal as it is, and
pursue type annotations and possibly IR in parallel. Does that sound
reasonable to everyone?

Ryan

On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi Anjali,
>
> That's a nice summary.
>
> re dialect field. It shouldn't be a bit trouble to have it (or any other
> way to identify application that created the view), and it might be useful.
> Why not make it required from the start?
>
> re "expanded/resolved SQL" -- i don't understand yet what we would put
> there, so cannot comment.
>
> I agree there it's nice to get something out of the door, and I see how
> the current proposal fits some needs already.
> However, i am concerned about the proliferation of non-cross-engine
> compatible views, if we do that.
>
> Also, if we later agree on any compatible approach (portable subset of
> SQL, engine-agnostic IR, etc.), then from the perspective of each engine,
> it would be a breaking change.
> Unless we make the compatible approach as expressive as full power of SQL,
> some views that are possible to create in v1 will not be possible to create
> in v2.
> Thus, if v1  is "some SQL" and v2 is "something awesomely compatible", we
> may not be able to roll it out.
>
> > the convention of common SQL has been working for a majority of users.
> SQL features commonly used are column projections, simple filter
> application, joins, grouping and common aggregate and scalar function. A
> few users occasionally would like to use Trino or Spark specific functions
> but are sometimes able to find a way to use a function that is common to
> both the engines.
>
>
> it's an awesome summary of what constructs are necessary to be able to
> define useful views, while also keep them portable.
>
> To be able to express column projections, simple filter application,
> joins, grouping and common aggregate and scalar function in a structured
> IR, how much effort do you think would be required?
> We didn't really talk about downsides of a structured approach, other than
> it looks complex.
> if we indeed estimate it as a multi-year effort, i wouldn't argue for
> that. Maybe i were overly optimistic though.
>
>
> As Jack mentioned, for engine-specific approach that's not supposed to be
> consumed by multiple engines, we may be better served with approach that's
> outside of Iceberg spec, like https://github.com/trinodb/trino/pull/8540.
>
>
> Best,
> PF
>
>
>
>
>
> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
> <an...@netflix.com.invalid> wrote:
>
>> Hi,
>>
>> Thank you for all the comments. I will try to address them all here
>> together.
>>
>>
>>    - @all Cross engine compatibility of view definition: Multiple
>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>    We can all agree that all of these options are non-trivial to
>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>    and merit further discussion. I would like to suggest that we continue this
>>    discussion but target this work for the future (v2?). In v1, we can add an
>>    optional dialect field and an optional expanded/resolved SQL field that can
>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>    where the views are either accessed by a single engine or multi-engine use
>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>    desirable features such as versioning of views and a common format of
>>    storing view metadata while allowing extensibility in the future. *Does
>>    anyone feel strongly otherwise?*
>>    - @Piotr  As for common views at Netflix, the restrictions on SQL are
>>    not enforced, but are advised as best practices. The convention of common
>>    SQL has been working for a majority of users. SQL features commonly used
>>    are column projections, simple filter application, joins, grouping and
>>    common aggregate and scalar function. A few users occasionally would like
>>    to use Trino or Spark specific functions but are sometimes able to find a
>>    way to use a function that is common to both the engines.
>>    - @Jacques and @Jack Iceberg data types are engine agnostic and hence
>>    were picked for storing view schema. Thinking further, the schema field
>>    should be made 'optional', since not all engines require it. (e.g. Spark
>>    does not need it and Trino uses it only for validation).
>>    - @Jacques Table references in the views can be arbitrary objects
>>    such as tables from other catalogs or elasticsearch tables etc. I will
>>    clarify it in the spec.
>>
>> I will work on incorporating all the comments in the spec and make the
>> next revision available for review soon.
>>
>> Regards,
>> Anjali.
>>
>>
>>
>>
>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks Jack and Jacques for sharing your thoughts.
>>>
>>> I agree that tracking  dialect/origin is better than nothing.
>>> I think having a Map {dialect: sql} is not going to buy us much.
>>> I.e. it would be useful if there was some external app (or a human
>>> being) that would write those alternative SQLs for each dialect.
>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino, or
>>> Trino writing SQL for Trino and Spark.
>>>
>>> Thanks Jacques for a good summary of SQL supporting options.
>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>> (it's really well written and resembles spec quite well), you made a good
>>> point that grammar is only part of the problem. Coercions, function
>>> resolution, dereference resolution, table resolution are part of query
>>> analysis that goes beyond just grammar.
>>> In fact, column scoping rules -- while clearly defined by the spec --
>>> may easily differ between engines (pretty usual).
>>> That's why i would rather lean towards some intermediate representation
>>> that is *not *SQL, doesn't require parsing (is already structural), nor
>>> analysis (no scopes! no implicit coercions!).
>>> Before we embark on such a journey, it would be interesting to hear @Martin
>>> Traverso <ma...@starburstdata.com> 's thoughts on feasibility though.
>>>
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>>
>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <ja...@gmail.com>
>>> wrote:
>>>
>>>> Some thoughts...
>>>>
>>>>    - In general, many engines want (or may require) a resolved sql
>>>>    field. This--at minimum--typically includes star expansion since
>>>>    traditional view behavior is stars are expanded at view creation time
>>>>    (since this is the only way to guarantee that the view returns the same
>>>>    logical definition even if the underlying table changes). This may also
>>>>    include a replacement of relative object names to absolute object names
>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>    both of these things.
>>>>    - It isn't clear in the spec whether the table references used in
>>>>    views are restricted to other Iceberg objects or can be arbitrary objects
>>>>    in the context of a particular engine. Maybe I missed this? For example,
>>>>    can I have a Trino engine view that references an Elasticsearch table
>>>>    stored in an Iceberg view?
>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>    unintended consequences. I appreciate the attraction to it but I think it
>>>>    would either create artificial barriers around the types of SQL that are
>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>    change the behavior of the query which I believe violates most typical
>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>    issue where the view schema will be required to be different from a CTE
>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>    - It is important to remember that Calcite is a set of libraries
>>>>    and not a specification. There are things that can be specified in Calcite
>>>>    but in general it doesn't have formal specification as a first principle.
>>>>    It is more implementation as a first principle. This is in contrast to
>>>>    projects like Arrow and Iceberg, which start with well-formed
>>>>    specifications. I've been working with Calcite since before it was an
>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>    reference implementation standard for Iceberg view consumption,
>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>    must still be interpreted with regards to type promotion, function
>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>    fully resolved/comprehended.
>>>>
>>>> Given the above, I suggest having a field which describes the
>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>> establishing a formal SQL specification to be a similarly long process to
>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>> (Realistically, there is far more to specify here than there is in either
>>>> of those two domains.)
>>>>
>>>> Some other notes:
>>>>
>>>>    - Calcite does provide a nice reference document [3] but it is not
>>>>    sufficient to implement what is necessary for parsing/validating/resolving
>>>>    a SQL string correctly/consistently.
>>>>    - Projects like Coral [4] are interesting here but even Coral is
>>>>    based roughly on "HiveQL" which also doesn't have a formal specification
>>>>    process outside of the Hive version you're running. See this thread in
>>>>    Coral slack [5]
>>>>    - ZetaSQL [6] also seems interesting in this space. It feels closer
>>>>    to specification based [7] than Calcite but is much less popular in the big
>>>>    data domain. I also haven't reviewed it's SQL completeness closely, a
>>>>    strength of Calcite.
>>>>    - One of the other problems with building against an implementation
>>>>    as opposed to a specification (e.g. Calcite) is it can make it difficult or
>>>>    near impossible to implement the same algorithms again without a bunch of
>>>>    reverse engineering. If interested in an example of this, see the
>>>>    discussion behind LZ4 deprecation on the Parquet spec [8] for how painful
>>>>    this kind of mistake can become.
>>>>    - I'd love to use the SQL specification itself but nobody actually
>>>>    implements that in its entirety and it has far too many places where things
>>>>    are "implementation-defined" [9].
>>>>
>>>> [1]
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>> [2]
>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>> [3] https://calcite.apache.org/docs/reference.html
>>>> [4] https://github.com/linkedin/coral
>>>> [5] https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>> [6] https://github.com/google/zetasql
>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>> [8]
>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>
>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> Did not notice that we are also discussing cross-engine
>>>>> interoperability here, I will add my response in the design doc here.
>>>>>
>>>>> I would personally prefer cross-engine interoperability as a goal and
>>>>> get the spec in the right structure in the initial release, because:
>>>>>
>>>>> 1. I believe that cross-engine compatibility is a critical feature of
>>>>> Iceberg. If I am a user of an existing data lake that already supports
>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. I
>>>>> can do what is now done for Trino to use views with Iceberg. I can also
>>>>> just use a table property to indicate the table is a view and store the
>>>>> view SQL as a table property and do my own thing in any query engine to
>>>>> support all the view features. One of the most valuable and unique features
>>>>> that Iceberg view can unlock is to allow a view to be created in one engine
>>>>> and read by another. Not supporting cross-engine compatibility feels like
>>>>> losing a lot of value to me.
>>>>>
>>>>> 2. In the view definition, it feels inconsistent to me that we have
>>>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL that
>>>>> can come from any query engine. If the engine already needs to convert the
>>>>> view schema to iceberg shema, it should just do the same for the view SQL.
>>>>>
>>>>> Regarding the way to achieve it, I think it comes to either Apache
>>>>> Calcite (or some other third party alternative I don't know) or our own
>>>>> implementation of some intermediate representation. I don't have a very
>>>>> strong opinion, but my thoughts are the following:
>>>>>
>>>>> 1. Calcite is supposed to be the go-to software to deal with this kind
>>>>> of issue, but my personal concern is that the integration is definitely
>>>>> going to be much more involved, and it will become another barrier for
>>>>> newer engines to onboard because it not only needs to implement Iceberg
>>>>> APIs but also needs Calcite support. It will also start to become a
>>>>> constant discussion around what we maintain and what we should push to
>>>>> Calcite, similar to our situation today with Spark.
>>>>>
>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is to
>>>>> have a native lightweight logical query structure representation of the
>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>> extend on that to support this use case. Different engines can build this
>>>>> logical structure when traversing their own AST during a create view query.
>>>>>
>>>>> 3. With these considerations, I think the "sql" field can potentially
>>>>> be a map (maybe called "engine-sqls"?), where key is the engine type and
>>>>> version like "Spark 3.1", and value is the view SQL string. In this way,
>>>>> the engine that creates the view can still read the SQL directly which
>>>>> might lead to better engine-native integration and avoid redundant parsing.
>>>>> But in this approach there is always a default intermediate representation
>>>>> it can fallback to when the engine's key is not found in the map. If we
>>>>> want to make incremental progress and delay the design for the intermediate
>>>>> representation, I think we should at least use this map instead of just a
>>>>> single string.
>>>>>
>>>>> Thanks,
>>>>> Jack Ye
>>>>>
>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>> piotr@starburstdata.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> First of all thank you for this discussion and all the view-related
>>>>>> work!
>>>>>>
>>>>>> I agree that solving cross-engine compatibility problem may not be
>>>>>> primary feature today, I am concerned that not thinking about this from the
>>>>>> start may "tunnel" us into a wrong direction.
>>>>>> Cross-engine compatible views would be such a cool feature that it is
>>>>>> hard to just let it pass.
>>>>>>
>>>>>> My thinking about a smaller IR may be a side-product of me not being
>>>>>> familiar enough with Calcite.
>>>>>> However, with new IR being focused on compatible representation, and
>>>>>> not being tied to anything are actually good things.
>>>>>> For example, we need to focus on JSON representation, but we don't
>>>>>> need to deal with tree traversal or anything, so the code for this could be
>>>>>> pretty simple.
>>>>>>
>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>>>>> across engines in the view definition (this is how currently Netflix uses
>>>>>> these 'common' views across Spark and Trino)
>>>>>>
>>>>>> it's interesting. Anjali, do  you have means to enforce that, or is
>>>>>> this just a convention?
>>>>>>
>>>>>> What are the common building blocks (relational operations,
>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>> views?
>>>>>> Being able to enumerate them could help validate various approaches
>>>>>> considered here, including feasibility of dedicated representation.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> PF
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com> wrote:
>>>>>>
>>>>>>> Hey Anjali,
>>>>>>>
>>>>>>> I am definitely happy to help with implementing 1-3 in your first
>>>>>>> list once the spec has been approved by the community. My hope is that the
>>>>>>> final version of the view spec will make it easy to re-use existing
>>>>>>> rollback/time travel/metadata etc functionalities.
>>>>>>>
>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>
>>>>>>> Best,
>>>>>>> Ryan
>>>>>>>
>>>>>>> PS if anyone is interested in collaborating on engine agnostic views
>>>>>>> please reach out. I am keen on exploring this topic.
>>>>>>>
>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>
>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have
>>>>>>>> and will continue to address them in the doc.
>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>
>>>>>>>> Thanks to everybody who has offered help with implementation. The
>>>>>>>> spec as it is proposed in the doc has been implemented and is in use at
>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will rebase
>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>> implementation available quickly as a PR.
>>>>>>>>
>>>>>>>> A few areas that we have not yet worked on and would love for the
>>>>>>>> community to help are:
>>>>>>>> 1. Time travel on views: Be able to access the view as of a version
>>>>>>>> or time
>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>> similar to the $snapshots table in order to display the history of a view
>>>>>>>> 3. Rollback to a version: A way to rollback a view to a previous
>>>>>>>> version
>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>
>>>>>>>> One comment that is worth a broader discussion is the dialect of
>>>>>>>> the SQL stored in the view metadata. The purpose of the spec is to provide
>>>>>>>> a storage format for view metadata and APIs to access that metadata. The
>>>>>>>> dialect of the SQL stored is an orthogonal question and is outside the
>>>>>>>> scope of this spec.
>>>>>>>>
>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>>> across engines in the view definition (this is how currently Netflix uses
>>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>> 2. Add a field to the view metadata to identify the dialect of the
>>>>>>>> SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>>>> operability
>>>>>>>> 3. Store AST produced by Calcite in the view metadata and translate
>>>>>>>> back and forth between engine-supported SQL and AST
>>>>>>>> 4. Intermediate structured language of our own. (What additional
>>>>>>>> functionality does it provide over Calcite?)
>>>>>>>>
>>>>>>>> Given that the view metadata is json, it is easily extendable to
>>>>>>>> incorporate any new fields needed to make the SQL truly compatible across
>>>>>>>> engines.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> regards,
>>>>>>>> Anjali
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>
>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>> between query engines.
>>>>>>>>> How do you plan to address that?
>>>>>>>>>
>>>>>>>>> For example, I am familiar with Coral, and I appreciate its powers
>>>>>>>>> for dealing with legacy stuff like views defined by Hive.
>>>>>>>>> I treat it as a great technology supporting transitioning from a
>>>>>>>>> query engine to a better one.
>>>>>>>>> However, I would not base a design of some new system for storing
>>>>>>>>> cross-engine compatible views on that.
>>>>>>>>>
>>>>>>>>> Is there something else we can use?
>>>>>>>>> Maybe the view definition should use some intermediate structured
>>>>>>>>> language that's not SQL?
>>>>>>>>> For example, it could represent logical structure of operations in
>>>>>>>>> semantics manner.
>>>>>>>>> This would eliminate need for cross-engine compatible parsing and
>>>>>>>>> analysis.
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> PF
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <ry...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>
>>>>>>>>>> I have left some comments on the document. I unfortunately have
>>>>>>>>>> to miss the community meetup tomorrow but would love to chat more/help w/
>>>>>>>>>> implementation.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> John Zhuge and I would like to propose the following spec for
>>>>>>>>>>> storing view metadata in Iceberg. The proposal has been implemented [1] and
>>>>>>>>>>> is in production at Netflix for over 15 months.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>
>>>>>>>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Anjali.
>>>>>>>>>>>
>>>>>>>>>>

-- 
Ryan Blue
Tabular

Re: Proposal: Support for views in Iceberg

Posted by Anjali Norwood <an...@netflix.com.INVALID>.

Hi All,

Please see the spec in markdown format at the PR here
<https://github.com/apache/iceberg/pull/3188> to facilitate
adding/responding to comments. Please review.

thanks,
Anjali

On Tue, Sep 7, 2021 at 9:31 PM Jack Ye <ye...@gmail.com> wrote:

> Hi everyone,
>
> I have been thinking about the view support during the weekend, and I
> realize there is a conflict that Trino today already claims to support
> Iceberg view through Hive metastore.
>
> I believe we need to figure out a path forward around this issue before
> voting to pass the current proposal to avoid confusions for end users. I
> have summarized the issue here with a few different potential solutions:
>
>
> https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing
>
> Please let me know what you think.
>
> Best,
> Jack Ye
>
> On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <cp...@gmail.com> wrote:
>
>> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>>
>>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Would a physical plan be portable for the purpose of an engine-agnostic
>>>> view?
>>>>
>>>
>>> My goal is it would be. There may be optional "hints" that a particular
>>> engine could leverage and others wouldn't but I think the goal should be
>>> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
>>> there are really two independent heavy-weight engines that have their own
>>> capabilities and trajectories (c++ vs rust).
>>>
>>>
>>>> Physical plan details seem specific to an engine to me, but maybe I'm
>>>> thinking too much about how Spark is implemented. My inclination would be
>>>> to accept only logical IR, which could just mean accepting a subset of the
>>>> standard.
>>>>
>>>
>>> I think it is very likely that different consumers will only support a
>>> subset of plans. That being said, I'm not sure what you're specifically
>>> trying to mitigate or avoid. I'd be inclined to simply allow the full
>>> breadth of IR within Iceberg. If it is well specified, an engine can either
>>> choose to execute or not (same as the proposal wrt to SQL syntax or if a
>>> function is missing on an engine). The engine may even have internal
>>> rewrites if it likes doing things a different way than what is requested.
>>>
>>
>> I also believe that consumers will not be expected to support all plans.
>> It will depend on the consumer, but many of the instanations of Read/Write
>> relations won't be executable for many consumers, for example.
>>
>>
>>>
>>>
>>>> The document that Micah linked to is interesting, but I'm not sure that
>>>> our goals are aligned.
>>>>
>>>
>>> I think there is much commonality here and I'd argue it would be best to
>>> really try to see if a unified set of goals works well. I think Arrow IR is
>>> young enough that it can still be shaped/adapted. It may be that there
>>> should be some give or take on each side. It's possible that the goals are
>>> too far apart to unify but my gut is that they are close enough that we
>>> should try since it would be a great force multiplier.
>>>
>>>
>>>> For one thing, it seems to make assumptions about the IR being used for
>>>> Arrow data (at least in Wes' proposal), when I think that it may be easier
>>>> to be agnostic to vectorization.
>>>>
>>>
>>> Other than using the Arrow schema/types, I'm not at all convinced that
>>> the IR should be Arrow centric. I've actually argued to some that Arrow IR
>>> should be independent of Arrow to be its best self. Let's try to review it
>>> and see if/where we can avoid a tight coupling between plans and arrow
>>> specific concepts.
>>>
>>
>> Just to echo Jacques's comments here, the only thing that is Arrow
>> specific right now is the use of its type system. Literals, for example,
>> are encoded entirely in flatbuffers.
>>
>> Would love feedback on the current PR [1]. I'm looking to merge the first
>> iteration soonish, so please review at your earliest convenience.
>>
>>
>>>
>>>
>>>> It also delegates forward/backward compatibility to flatbuffers, when I
>>>> think compatibility should be part of the semantics and not delegated to
>>>> serialization. For example, if I have Join("inner", a.id, b.id) and I
>>>> evolve that to allow additional predicates Join("inner", a.id, b.id,
>>>> a.x < b.y) then just because I can deserialize it doesn't mean it is
>>>> compatible.
>>>>
>>>
>>> I don't think that flatbuffers alone can solve all compatibility
>>> problems. It can solve some and I'd expect that implementation libraries
>>> will have to solve others. Would love to hear if others disagree (and think
>>> flatbuffers can solve everything wrt compatibility).
>>>
>>
>> I agree, I think you need both to achieve sane versioning. The version
>> needs to be shipped along with the IR, and libraries need to be able deal
>> with the different versions. I could be wrong, but I think it probably
>> makes more sense to start versioning the IR once the dust has settled a bit.
>>
>>
>>>
>>> J
>>>
>>
>> [1]: https://github.com/apache/arrow/pull/10934
>>
>

Re: Proposal: Support for views in Iceberg

Posted by Jack Ye <ye...@gmail.com>.

Hi everyone,

I have been thinking about the view support during the weekend, and I
realize there is a conflict that Trino today already claims to support
Iceberg view through Hive metastore.

I believe we need to figure out a path forward around this issue before
voting to pass the current proposal to avoid confusions for end users. I
have summarized the issue here with a few different potential solutions:

https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing

Please let me know what you think.

Best,
Jack Ye

On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <cp...@gmail.com> wrote:

> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <ja...@gmail.com>
> wrote:
>
>>
>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Would a physical plan be portable for the purpose of an engine-agnostic
>>> view?
>>>
>>
>> My goal is it would be. There may be optional "hints" that a particular
>> engine could leverage and others wouldn't but I think the goal should be
>> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
>> there are really two independent heavy-weight engines that have their own
>> capabilities and trajectories (c++ vs rust).
>>
>>
>>> Physical plan details seem specific to an engine to me, but maybe I'm
>>> thinking too much about how Spark is implemented. My inclination would be
>>> to accept only logical IR, which could just mean accepting a subset of the
>>> standard.
>>>
>>
>> I think it is very likely that different consumers will only support a
>> subset of plans. That being said, I'm not sure what you're specifically
>> trying to mitigate or avoid. I'd be inclined to simply allow the full
>> breadth of IR within Iceberg. If it is well specified, an engine can either
>> choose to execute or not (same as the proposal wrt to SQL syntax or if a
>> function is missing on an engine). The engine may even have internal
>> rewrites if it likes doing things a different way than what is requested.
>>
>
> I also believe that consumers will not be expected to support all plans.
> It will depend on the consumer, but many of the instanations of Read/Write
> relations won't be executable for many consumers, for example.
>
>
>>
>>
>>> The document that Micah linked to is interesting, but I'm not sure that
>>> our goals are aligned.
>>>
>>
>> I think there is much commonality here and I'd argue it would be best to
>> really try to see if a unified set of goals works well. I think Arrow IR is
>> young enough that it can still be shaped/adapted. It may be that there
>> should be some give or take on each side. It's possible that the goals are
>> too far apart to unify but my gut is that they are close enough that we
>> should try since it would be a great force multiplier.
>>
>>
>>> For one thing, it seems to make assumptions about the IR being used for
>>> Arrow data (at least in Wes' proposal), when I think that it may be easier
>>> to be agnostic to vectorization.
>>>
>>
>> Other than using the Arrow schema/types, I'm not at all convinced that
>> the IR should be Arrow centric. I've actually argued to some that Arrow IR
>> should be independent of Arrow to be its best self. Let's try to review it
>> and see if/where we can avoid a tight coupling between plans and arrow
>> specific concepts.
>>
>
> Just to echo Jacques's comments here, the only thing that is Arrow
> specific right now is the use of its type system. Literals, for example,
> are encoded entirely in flatbuffers.
>
> Would love feedback on the current PR [1]. I'm looking to merge the first
> iteration soonish, so please review at your earliest convenience.
>
>
>>
>>
>>> It also delegates forward/backward compatibility to flatbuffers, when I
>>> think compatibility should be part of the semantics and not delegated to
>>> serialization. For example, if I have Join("inner", a.id, b.id) and I
>>> evolve that to allow additional predicates Join("inner", a.id, b.id,
>>> a.x < b.y) then just because I can deserialize it doesn't mean it is
>>> compatible.
>>>
>>
>> I don't think that flatbuffers alone can solve all compatibility
>> problems. It can solve some and I'd expect that implementation libraries
>> will have to solve others. Would love to hear if others disagree (and think
>> flatbuffers can solve everything wrt compatibility).
>>
>
> I agree, I think you need both to achieve sane versioning. The version
> needs to be shipped along with the IR, and libraries need to be able deal
> with the different versions. I could be wrong, but I think it probably
> makes more sense to start versioning the IR once the dust has settled a bit.
>
>
>>
>> J
>>
>
> [1]: https://github.com/apache/arrow/pull/10934
>

Re: Proposal: Support for views in Iceberg

Posted by Phillip Cloud <cp...@gmail.com>.

On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <ja...@gmail.com>
wrote:

>
> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> Would a physical plan be portable for the purpose of an engine-agnostic
>> view?
>>
>
> My goal is it would be. There may be optional "hints" that a particular
> engine could leverage and others wouldn't but I think the goal should be
> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
> there are really two independent heavy-weight engines that have their own
> capabilities and trajectories (c++ vs rust).
>
>
>> Physical plan details seem specific to an engine to me, but maybe I'm
>> thinking too much about how Spark is implemented. My inclination would be
>> to accept only logical IR, which could just mean accepting a subset of the
>> standard.
>>
>
> I think it is very likely that different consumers will only support a
> subset of plans. That being said, I'm not sure what you're specifically
> trying to mitigate or avoid. I'd be inclined to simply allow the full
> breadth of IR within Iceberg. If it is well specified, an engine can either
> choose to execute or not (same as the proposal wrt to SQL syntax or if a
> function is missing on an engine). The engine may even have internal
> rewrites if it likes doing things a different way than what is requested.
>

I also believe that consumers will not be expected to support all plans. It
will depend on the consumer, but many of the instanations of Read/Write
relations won't be executable for many consumers, for example.


>
>
>> The document that Micah linked to is interesting, but I'm not sure that
>> our goals are aligned.
>>
>
> I think there is much commonality here and I'd argue it would be best to
> really try to see if a unified set of goals works well. I think Arrow IR is
> young enough that it can still be shaped/adapted. It may be that there
> should be some give or take on each side. It's possible that the goals are
> too far apart to unify but my gut is that they are close enough that we
> should try since it would be a great force multiplier.
>
>
>> For one thing, it seems to make assumptions about the IR being used for
>> Arrow data (at least in Wes' proposal), when I think that it may be easier
>> to be agnostic to vectorization.
>>
>
> Other than using the Arrow schema/types, I'm not at all convinced that the
> IR should be Arrow centric. I've actually argued to some that Arrow IR
> should be independent of Arrow to be its best self. Let's try to review it
> and see if/where we can avoid a tight coupling between plans and arrow
> specific concepts.
>

Just to echo Jacques's comments here, the only thing that is Arrow specific
right now is the use of its type system. Literals, for example, are encoded
entirely in flatbuffers.

Would love feedback on the current PR [1]. I'm looking to merge the first
iteration soonish, so please review at your earliest convenience.


>
>
>> It also delegates forward/backward compatibility to flatbuffers, when I
>> think compatibility should be part of the semantics and not delegated to
>> serialization. For example, if I have Join("inner", a.id, b.id) and I
>> evolve that to allow additional predicates Join("inner", a.id, b.id, a.x
>> < b.y) then just because I can deserialize it doesn't mean it is compatible.
>>
>
> I don't think that flatbuffers alone can solve all compatibility problems.
> It can solve some and I'd expect that implementation libraries will have to
> solve others. Would love to hear if others disagree (and think flatbuffers
> can solve everything wrt compatibility).
>

I agree, I think you need both to achieve sane versioning. The version
needs to be shipped along with the IR, and libraries need to be able deal
with the different versions. I could be wrong, but I think it probably
makes more sense to start versioning the IR once the dust has settled a bit.


>
> J
>

[1]: https://github.com/apache/arrow/pull/10934

Re: Proposal: Support for views in Iceberg

Posted by Jacques Nadeau <ja...@gmail.com>.

On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <bl...@tabular.io> wrote:

> Would a physical plan be portable for the purpose of an engine-agnostic
> view?
>

My goal is it would be. There may be optional "hints" that a particular
engine could leverage and others wouldn't but I think the goal should be
that the IR is entirely engine-agnostic. Even in the Arrow project proper,
there are really two independent heavy-weight engines that have their own
capabilities and trajectories (c++ vs rust).

> Physical plan details seem specific to an engine to me, but maybe I'm
> thinking too much about how Spark is implemented. My inclination would be
> to accept only logical IR, which could just mean accepting a subset of the
> standard.
>

I think it is very likely that different consumers will only support a
subset of plans. That being said, I'm not sure what you're specifically
trying to mitigate or avoid. I'd be inclined to simply allow the full
breadth of IR within Iceberg. If it is well specified, an engine can either
choose to execute or not (same as the proposal wrt to SQL syntax or if a
function is missing on an engine). The engine may even have internal
rewrites if it likes doing things a different way than what is requested.

> The document that Micah linked to is interesting, but I'm not sure that
> our goals are aligned.
>

I think there is much commonality here and I'd argue it would be best to
really try to see if a unified set of goals works well. I think Arrow IR is
young enough that it can still be shaped/adapted. It may be that there
should be some give or take on each side. It's possible that the goals are
too far apart to unify but my gut is that they are close enough that we
should try since it would be a great force multiplier.

> For one thing, it seems to make assumptions about the IR being used for
> Arrow data (at least in Wes' proposal), when I think that it may be easier
> to be agnostic to vectorization.
>

Other than using the Arrow schema/types, I'm not at all convinced that the
IR should be Arrow centric. I've actually argued to some that Arrow IR
should be independent of Arrow to be its best self. Let's try to review it
and see if/where we can avoid a tight coupling between plans and arrow
specific concepts.

> It also delegates forward/backward compatibility to flatbuffers, when I
> think compatibility should be part of the semantics and not delegated to
> serialization. For example, if I have Join("inner", a.id, b.id) and I
> evolve that to allow additional predicates Join("inner", a.id, b.id, a.x
> < b.y) then just because I can deserialize it doesn't mean it is compatible.
>

I don't think that flatbuffers alone can solve all compatibility problems.
It can solve some and I'd expect that implementation libraries will have to
solve others. Would love to hear if others disagree (and think flatbuffers
can solve everything wrt compatibility).

J

>

Re: Proposal: Support for views in Iceberg

Posted by Ryan Blue <bl...@tabular.io>.

Would a physical plan be portable for the purpose of an engine-agnostic
view?

Physical plan details seem specific to an engine to me, but maybe I'm
thinking too much about how Spark is implemented. My inclination would be
to accept only logical IR, which could just mean accepting a subset of the
standard. The document that Micah linked to is interesting, but I'm not
sure that our goals are aligned. For one thing, it seems to make
assumptions about the IR being used for Arrow data (at least in Wes'
proposal), when I think that it may be easier to be agnostic to
vectorization.

It also delegates forward/backward compatibility to flatbuffers, when I
think compatibility should be part of the semantics and not delegated to
serialization. For example, if I have Join("inner", a.id, b.id) and I
evolve that to allow additional predicates Join("inner", a.id, b.id, a.x <
b.y) then just because I can deserialize it doesn't mean it is compatible.

Ryan

On Thu, Aug 26, 2021 at 2:36 PM Jacques Nadeau <ja...@gmail.com>
wrote:

> The original proposal was more focused on executable plans (as opposed to
> the logical plans). However, I'm arguing that it should include both
> logical and physical since the two are fairly heavily overlapping. (See how
> much logic is in the base classes for most Calcite operators as opposed to
> the concrete logical or physical variations.) If other people agree with my
> proposal that Arrow IR should be able to express both logical and physical,
> then my hope was it could potentially serve the purposes discussed here for
> an engine-agnostic version of a view.
>
> +Phillip Cloud <cp...@gmail.com> here as he is actively working on the
> Arrow IR stuff (i've also suggested he subscribe to iceberg dev).
>
>
>
> On Thu, Aug 26, 2021 at 1:51 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> I'll point a link to this thread on dev@arrow so people not subscribed
>> to both can way in?  Should we maybe start a new thread about IR?
>>
>>
>> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit
>> is the document that Wes wrote up that started the conversation.
>>
>> On Thu, Aug 26, 2021 at 1:42 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Micah or someone more familiar (Jacques?), could you summarize some of
>>> the details of the Arrow IR proposal? What are the intended use cases and
>>> goals? We'd want to see whether the goals align with our own.
>>>
>>> On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> Small tangent:
>>>>
>>>> Regarding an IR, Apache Arrow is looking at this now [1] with several
>>>> RFC proposals.  It would be nice to coordinate, at least to make sure
>>>> translation from one IR to another isn't too onerous (in an ideal world we
>>>> could maybe share the same one).  Either way external feedback would be
>>>> useful.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>> [1]
>>>> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=EjD0QpL9LYGYnq2jaGXo+LfeR5uw@mail.gmail.com%3e
>>>>
>>>> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> I think that the current proposal is looking good, but it is always a
>>>>> good idea to give people a few days to review it and bring up any issues or
>>>>> further discussion on the topics in this thread.
>>>>>
>>>>> I'll also add this to the next sync agenda so we can farm for dissent
>>>>> next week. Sometimes you can get a better read on what is still concerning
>>>>> in person.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
>>>>> <an...@netflix.com.invalid> wrote:
>>>>>
>>>>>> Hello All,
>>>>>> I answered all the comments and made changes to the spec to reflect
>>>>>> them .. the doc now shows the new revision (link here:
>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>>>>>> ).
>>>>>> Please diff with the version named 'Spec v1' in order to see the
>>>>>> deltas. Please let me know if any of your comments are not
>>>>>> satisfactorily addressed.
>>>>>>
>>>>>> @Ryan, Thank you for the insightful feedback, especially around the
>>>>>> use of Iceberg data types for schema and possible evolution.
>>>>>> I agree with your comment: *"I think we should move forward with the
>>>>>> proposal as it is, and pursue type annotations and possibly IR in
>>>>>> parallel."*
>>>>>> How do we achieve consensus/farm for dissent and move forward with
>>>>>> the proposal?
>>>>>>
>>>>>> thanks,
>>>>>> Anjali.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jz...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>>>>>> discussion here with everyone.
>>>>>>>>
>>>>>>>> For the discussion about SQL dialect, I think that the right first
>>>>>>>> step is to capture the SQL or query dialect. That will give us the most
>>>>>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>>>>>> convert and engines that don’t can see if the SQL is valid and can be used.
>>>>>>>>
>>>>>>>> I think that the idea to create a minimal IR is an interesting one,
>>>>>>>> but can be added later. We will always need to record the SQL and dialect,
>>>>>>>> even if we translate to IR because users are going to configure views using
>>>>>>>> SQL. Uses like showing view history or debugging need to show the original
>>>>>>>> SQL, plus relevant information like where it was created and the SQL
>>>>>>>> dialect. We should be able to add this later by adding additional metadata
>>>>>>>> to the view definition. I don’t think that it would introduce breaking
>>>>>>>> changes to add a common representation that can be optionally consumed.
>>>>>>>>
>>>>>>>> Let’s continue talking about a minimal IR, separately. View
>>>>>>>> translation is a hard problem. Right now, to get views across engines we
>>>>>>>> have to compromise confidence. IR is a way to have strong confidence, but
>>>>>>>> with limited expressibility. I think that’s a good trade in a lot of cases
>>>>>>>> and is worth pursuing, even if it will take a long time.
>>>>>>>>
>>>>>>>> Jacques makes a great point about types, but I think that the right
>>>>>>>> option here is to continue using Iceberg types. We’ve already had
>>>>>>>> discussions about whether Iceberg should support annotating types with
>>>>>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>>>>>> also providing compatibility across engines: char(n) is not
>>>>>>>> necessarily supported everywhere and mapping it to string will
>>>>>>>> make sense in most places. The schema is primarily used to validate that
>>>>>>>> the data produced by the query hasn’t changed and that is more about the
>>>>>>>> number of columns in structs and the names of fields rather than exact
>>>>>>>> types. We can fix up types when substituting without losing too much: if
>>>>>>>> the SQL produces a varchar(10) field that the view metadata says
>>>>>>>> is a string, then it’s okay that it is varchar(10). There is some
>>>>>>>> loss in that we don’t know if it was originally varchar(5), but I
>>>>>>>> think that this is not going to cause too many issues. Not all engines will
>>>>>>>> even validate that the schema has not changed, since it could be valid to
>>>>>>>> use select * from x where ... and allow new fields to appear.
>>>>>>>>
>>>>>>>> Right now, I think we should move forward with the proposal as it
>>>>>>>> is, and pursue type annotations and possibly IR in parallel. Does that
>>>>>>>> sound reasonable to everyone?
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Anjali,
>>>>>>>>>
>>>>>>>>> That's a nice summary.
>>>>>>>>>
>>>>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>>>>>> other way to identify application that created the view), and it might be
>>>>>>>>> useful.
>>>>>>>>> Why not make it required from the start?
>>>>>>>>>
>>>>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would
>>>>>>>>> put there, so cannot comment.
>>>>>>>>>
>>>>>>>>> I agree there it's nice to get something out of the door, and I
>>>>>>>>> see how the current proposal fits some needs already.
>>>>>>>>> However, i am concerned about the proliferation of
>>>>>>>>> non-cross-engine compatible views, if we do that.
>>>>>>>>>
>>>>>>>>> Also, if we later agree on any compatible approach (portable
>>>>>>>>> subset of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>>>>> engine, it would be a breaking change.
>>>>>>>>> Unless we make the compatible approach as expressive as full power
>>>>>>>>> of SQL, some views that are possible to create in v1 will not be possible
>>>>>>>>> to create in v2.
>>>>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>>>>> compatible", we may not be able to roll it out.
>>>>>>>>>
>>>>>>>>> > the convention of common SQL has been working for a majority of
>>>>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>>>>> application, joins, grouping and common aggregate and scalar function. A
>>>>>>>>> few users occasionally would like to use Trino or Spark specific functions
>>>>>>>>> but are sometimes able to find a way to use a function that is common to
>>>>>>>>> both the engines.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> it's an awesome summary of what constructs are necessary to be
>>>>>>>>> able to define useful views, while also keep them portable.
>>>>>>>>>
>>>>>>>>> To be able to express column projections, simple filter
>>>>>>>>> application, joins, grouping and common aggregate and scalar function in a
>>>>>>>>> structured IR, how much effort do you think would be required?
>>>>>>>>> We didn't really talk about downsides of a structured approach,
>>>>>>>>> other than it looks complex.
>>>>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue
>>>>>>>>> for that. Maybe i were overly optimistic though.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As Jack mentioned, for engine-specific approach that's not
>>>>>>>>> supposed to be consumed by multiple engines, we may be better served with
>>>>>>>>> approach that's outside of Iceberg spec, like
>>>>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> PF
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Thank you for all the comments. I will try to address them all
>>>>>>>>>> here together.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - @all Cross engine compatibility of view definition:
>>>>>>>>>>    Multiple options such as engine agnostic SQL or IR of some form have been
>>>>>>>>>>    mentioned. We can all agree that all of these options are non-trivial to
>>>>>>>>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>>>>>>>>    and merit further discussion. I would like to suggest that we continue this
>>>>>>>>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>>>>>>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>>>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>>>>>>>    where the views are either accessed by a single engine or multi-engine use
>>>>>>>>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>>>>>>>>    desirable features such as versioning of views and a common format of
>>>>>>>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>>>>>>>    anyone feel strongly otherwise?*
>>>>>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on
>>>>>>>>>>    SQL are not enforced, but are advised as best practices. The convention of
>>>>>>>>>>    common SQL has been working for a majority of users. SQL features commonly
>>>>>>>>>>    used are column projections, simple filter application, joins, grouping and
>>>>>>>>>>    common aggregate and scalar function. A few users occasionally would like
>>>>>>>>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>>>>>>>>    way to use a function that is common to both the engines.
>>>>>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic
>>>>>>>>>>    and hence were picked for storing view schema. Thinking further, the schema
>>>>>>>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>>>>>    - @Jacques Table references in the views can be arbitrary
>>>>>>>>>>    objects such as tables from other catalogs or elasticsearch tables etc. I
>>>>>>>>>>    will clarify it in the spec.
>>>>>>>>>>
>>>>>>>>>> I will work on incorporating all the comments in the spec and
>>>>>>>>>> make the next revision available for review soon.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Anjali.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>>>>>
>>>>>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>>>>>> I.e. it would be useful if there was some external app (or a
>>>>>>>>>>> human being) that would write those alternative SQLs for each dialect.
>>>>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and
>>>>>>>>>>> Trino, or Trino writing SQL for Trino and Spark.
>>>>>>>>>>>
>>>>>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar
>>>>>>>>>>> file (it's really well written and resembles spec quite well), you made a
>>>>>>>>>>> good point that grammar is only part of the problem. Coercions, function
>>>>>>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>>>>>>> analysis that goes beyond just grammar.
>>>>>>>>>>> In fact, column scoping rules -- while clearly defined by the
>>>>>>>>>>> spec -- may easily differ between engines (pretty usual).
>>>>>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>>>>>>> Before we embark on such a journey, it would be interesting to
>>>>>>>>>>> hear @Martin Traverso <ma...@starburstdata.com> 's thoughts on
>>>>>>>>>>> feasibility though.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> PF
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Some thoughts...
>>>>>>>>>>>>
>>>>>>>>>>>>    - In general, many engines want (or may require) a resolved
>>>>>>>>>>>>    sql field. This--at minimum--typically includes star expansion since
>>>>>>>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>>>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>>>>>>>>    logical definition even if the underlying table changes). This may also
>>>>>>>>>>>>    include a replacement of relative object names to absolute object names
>>>>>>>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>>>>>>>>    both of these things.
>>>>>>>>>>>>    - It isn't clear in the spec whether the table references
>>>>>>>>>>>>    used in views are restricted to other Iceberg objects or can be arbitrary
>>>>>>>>>>>>    objects in the context of a particular engine. Maybe I missed this? For
>>>>>>>>>>>>    example, can I have a Trino engine view that references an Elasticsearch
>>>>>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead
>>>>>>>>>>>>    to unintended consequences. I appreciate the attraction to it but I think
>>>>>>>>>>>>    it would either create artificial barriers around the types of SQL that are
>>>>>>>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>>>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>>>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>>>>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>>>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>>>>>    libraries and not a specification. There are things that can be specified
>>>>>>>>>>>>    in Calcite but in general it doesn't have formal specification as a first
>>>>>>>>>>>>    principle. It is more implementation as a first principle. This is in
>>>>>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with well-formed
>>>>>>>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>>>>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>>>>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>>>>>>>>    fully resolved/comprehended.
>>>>>>>>>>>>
>>>>>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>>>>>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>>>>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>>>>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>>>>>>>> of those two domains.)
>>>>>>>>>>>>
>>>>>>>>>>>> Some other notes:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Calcite does provide a nice reference document [3] but it
>>>>>>>>>>>>    is not sufficient to implement what is necessary for
>>>>>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>>>>>    - Projects like Coral [4] are interesting here but even
>>>>>>>>>>>>    Coral is based roughly on "HiveQL" which also doesn't have a formal
>>>>>>>>>>>>    specification process outside of the Hive version you're running. See this
>>>>>>>>>>>>    thread in Coral slack [5]
>>>>>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It
>>>>>>>>>>>>    feels closer to specification based [7] than Calcite but is much less
>>>>>>>>>>>>    popular in the big data domain. I also haven't reviewed it's SQL
>>>>>>>>>>>>    completeness closely, a strength of Calcite.
>>>>>>>>>>>>    - One of the other problems with building against an
>>>>>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>>>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>>>>>    actually implements that in its entirety and it has far too many places
>>>>>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>>>>>> [5]
>>>>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>>>>>> [7]
>>>>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>>>>>> [8]
>>>>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would personally prefer cross-engine interoperability as a
>>>>>>>>>>>>> goal and get the spec in the right structure in the initial release,
>>>>>>>>>>>>> because:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical
>>>>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that already
>>>>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this view
>>>>>>>>>>>>> feature. I can do what is now done for Trino to use views with Iceberg. I
>>>>>>>>>>>>> can also just use a table property to indicate the table is a view and
>>>>>>>>>>>>> store the view SQL as a table property and do my own thing in any query
>>>>>>>>>>>>> engine to support all the view features. One of the most valuable and
>>>>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to be
>>>>>>>>>>>>> created in one engine and read by another. Not supporting cross-engine
>>>>>>>>>>>>> compatibility feels like losing a lot of value to me.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the view SQL
>>>>>>>>>>>>> that can come from any query engine. If the engine already needs to convert
>>>>>>>>>>>>> the view schema to iceberg shema, it should just do the same for the view
>>>>>>>>>>>>> SQL.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding the way to achieve it, I think it comes to either
>>>>>>>>>>>>> Apache Calcite (or some other third party alternative I don't know) or our
>>>>>>>>>>>>> own implementation of some intermediate representation. I don't have a very
>>>>>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with
>>>>>>>>>>>>> this kind of issue, but my personal concern is that the integration is
>>>>>>>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>>>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>>>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested,
>>>>>>>>>>>>> is to have a native lightweight logical query structure representation of
>>>>>>>>>>>>> the view SQL and store that instead of the SQL string. We already deal with
>>>>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>>>>>>>> extend on that to support this use case. Different engines can build this
>>>>>>>>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the engine
>>>>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>>>>>>>> this way, the engine that creates the view can still read the SQL directly
>>>>>>>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>>>>>>> representation it can fallback to when the engine's key is not found in the
>>>>>>>>>>>>> map. If we want to make incremental progress and delay the design for the
>>>>>>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>>>>>>> instead of just a single string.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>>>>>> view-related work!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree that solving cross-engine compatibility problem may
>>>>>>>>>>>>>> not be primary feature today, I am concerned that not thinking about this
>>>>>>>>>>>>>> from the start may "tunnel" us into a wrong direction.
>>>>>>>>>>>>>> Cross-engine compatible views would be such a cool feature
>>>>>>>>>>>>>> that it is hard to just let it pass.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me
>>>>>>>>>>>>>> not being familiar enough with Calcite.
>>>>>>>>>>>>>> However, with new IR being focused on compatible
>>>>>>>>>>>>>> representation, and not being tied to anything are actually good things.
>>>>>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code for this
>>>>>>>>>>>>>> could be pretty simple.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that,
>>>>>>>>>>>>>> or is this just a convention?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>>>>>>>> views?
>>>>>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>>>>>> representation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>>>>>> first list once the spec has been approved by the community. My hope is
>>>>>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> PS if anyone is interested in collaborating on engine
>>>>>>>>>>>>>>> agnostic views please reach out. I am keen on exploring this topic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments.
>>>>>>>>>>>>>>>> I have and will continue to address them in the doc.
>>>>>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks to everybody who has offered help with
>>>>>>>>>>>>>>>> implementation. The spec as it is proposed in the doc has been implemented
>>>>>>>>>>>>>>>> and is in use at Netflix (currently on Iceberg 0.9). Once we close the
>>>>>>>>>>>>>>>> spec, we will rebase our code to Iceberg-0.12 and incorporate changes to
>>>>>>>>>>>>>>>> format and other feedback from the community and should be able to make
>>>>>>>>>>>>>>>> this MVP implementation available quickly as a PR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> A few areas that we have not yet worked on and would love
>>>>>>>>>>>>>>>> for the community to help are:
>>>>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>>>>>> version or time
>>>>>>>>>>>>>>>> 2. History table: A system table implementation for
>>>>>>>>>>>>>>>> $versions similar to the $snapshots table in order to display the history
>>>>>>>>>>>>>>>> of a view
>>>>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>>>>>> previous version
>>>>>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One comment that is worth a broader discussion is the
>>>>>>>>>>>>>>>> dialect of the SQL stored in the view metadata. The purpose of the spec is
>>>>>>>>>>>>>>>> to provide a storage format for view metadata and APIs to access that
>>>>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal question and is
>>>>>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect
>>>>>>>>>>>>>>>> of the SQL. This allows for any desired dialect, but no improved
>>>>>>>>>>>>>>>> cross-engine operability
>>>>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Given that the view metadata is json, it is easily
>>>>>>>>>>>>>>>> extendable to incorporate any new fields needed to make the SQL truly
>>>>>>>>>>>>>>>> compatible across engines.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>> Anjali
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>>>>>> Of course, this is by no means usable by other query
>>>>>>>>>>>>>>>>> engines.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Anjali, your document does not talk much about
>>>>>>>>>>>>>>>>> compatibility between query engines.
>>>>>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate
>>>>>>>>>>>>>>>>> its powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning
>>>>>>>>>>>>>>>>> from a query engine to a better one.
>>>>>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible
>>>>>>>>>>>>>>>>> parsing and analysis.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <
>>>>>>>>>>>>>>>>> rymurr@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have left some comments on the document. I
>>>>>>>>>>>>>>>>>> unfortunately have to miss the community meetup tomorrow but would love to
>>>>>>>>>>>>>>>>>> chat more/help w/ implementation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following
>>>>>>>>>>>>>>>>>>> spec for storing view metadata in Iceberg. The proposal has been
>>>>>>>>>>>>>>>>>>> implemented [1] and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to
>>>>>>>>>>>>>>>>>>> the doc.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Re: Proposal: Support for views in Iceberg

Posted by Jacques Nadeau <ja...@gmail.com>.

The original proposal was more focused on executable plans (as opposed to
the logical plans). However, I'm arguing that it should include both
logical and physical since the two are fairly heavily overlapping. (See how
much logic is in the base classes for most Calcite operators as opposed to
the concrete logical or physical variations.) If other people agree with my
proposal that Arrow IR should be able to express both logical and physical,
then my hope was it could potentially serve the purposes discussed here for
an engine-agnostic version of a view.

+Phillip Cloud <cp...@gmail.com> here as he is actively working on the
Arrow IR stuff (i've also suggested he subscribe to iceberg dev).



On Thu, Aug 26, 2021 at 1:51 PM Micah Kornfield <em...@gmail.com>
wrote:

> I'll point a link to this thread on dev@arrow so people not subscribed to
> both can way in?  Should we maybe start a new thread about IR?
>
>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit
> is the document that Wes wrote up that started the conversation.
>
> On Thu, Aug 26, 2021 at 1:42 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> Micah or someone more familiar (Jacques?), could you summarize some of
>> the details of the Arrow IR proposal? What are the intended use cases and
>> goals? We'd want to see whether the goals align with our own.
>>
>> On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Small tangent:
>>>
>>> Regarding an IR, Apache Arrow is looking at this now [1] with several
>>> RFC proposals.  It would be nice to coordinate, at least to make sure
>>> translation from one IR to another isn't too onerous (in an ideal world we
>>> could maybe share the same one).  Either way external feedback would be
>>> useful.
>>>
>>> Cheers,
>>> Micah
>>>
>>>
>>> [1]
>>> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=EjD0QpL9LYGYnq2jaGXo+LfeR5uw@mail.gmail.com%3e
>>>
>>> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> I think that the current proposal is looking good, but it is always a
>>>> good idea to give people a few days to review it and bring up any issues or
>>>> further discussion on the topics in this thread.
>>>>
>>>> I'll also add this to the next sync agenda so we can farm for dissent
>>>> next week. Sometimes you can get a better read on what is still concerning
>>>> in person.
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
>>>> <an...@netflix.com.invalid> wrote:
>>>>
>>>>> Hello All,
>>>>> I answered all the comments and made changes to the spec to reflect
>>>>> them .. the doc now shows the new revision (link here:
>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>>>>> ).
>>>>> Please diff with the version named 'Spec v1' in order to see the
>>>>> deltas. Please let me know if any of your comments are not
>>>>> satisfactorily addressed.
>>>>>
>>>>> @Ryan, Thank you for the insightful feedback, especially around the
>>>>> use of Iceberg data types for schema and possible evolution.
>>>>> I agree with your comment: *"I think we should move forward with the
>>>>> proposal as it is, and pursue type annotations and possibly IR in
>>>>> parallel."*
>>>>> How do we achieve consensus/farm for dissent and move forward with the
>>>>> proposal?
>>>>>
>>>>> thanks,
>>>>> Anjali.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>>>>> discussion here with everyone.
>>>>>>>
>>>>>>> For the discussion about SQL dialect, I think that the right first
>>>>>>> step is to capture the SQL or query dialect. That will give us the most
>>>>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>>>>> convert and engines that don’t can see if the SQL is valid and can be used.
>>>>>>>
>>>>>>> I think that the idea to create a minimal IR is an interesting one,
>>>>>>> but can be added later. We will always need to record the SQL and dialect,
>>>>>>> even if we translate to IR because users are going to configure views using
>>>>>>> SQL. Uses like showing view history or debugging need to show the original
>>>>>>> SQL, plus relevant information like where it was created and the SQL
>>>>>>> dialect. We should be able to add this later by adding additional metadata
>>>>>>> to the view definition. I don’t think that it would introduce breaking
>>>>>>> changes to add a common representation that can be optionally consumed.
>>>>>>>
>>>>>>> Let’s continue talking about a minimal IR, separately. View
>>>>>>> translation is a hard problem. Right now, to get views across engines we
>>>>>>> have to compromise confidence. IR is a way to have strong confidence, but
>>>>>>> with limited expressibility. I think that’s a good trade in a lot of cases
>>>>>>> and is worth pursuing, even if it will take a long time.
>>>>>>>
>>>>>>> Jacques makes a great point about types, but I think that the right
>>>>>>> option here is to continue using Iceberg types. We’ve already had
>>>>>>> discussions about whether Iceberg should support annotating types with
>>>>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>>>>> also providing compatibility across engines: char(n) is not
>>>>>>> necessarily supported everywhere and mapping it to string will make
>>>>>>> sense in most places. The schema is primarily used to validate that the
>>>>>>> data produced by the query hasn’t changed and that is more about the number
>>>>>>> of columns in structs and the names of fields rather than exact types. We
>>>>>>> can fix up types when substituting without losing too much: if the SQL
>>>>>>> produces a varchar(10) field that the view metadata says is a string,
>>>>>>> then it’s okay that it is varchar(10). There is some loss in that
>>>>>>> we don’t know if it was originally varchar(5), but I think that
>>>>>>> this is not going to cause too many issues. Not all engines will even
>>>>>>> validate that the schema has not changed, since it could be valid to use select
>>>>>>> * from x where ... and allow new fields to appear.
>>>>>>>
>>>>>>> Right now, I think we should move forward with the proposal as it
>>>>>>> is, and pursue type annotations and possibly IR in parallel. Does that
>>>>>>> sound reasonable to everyone?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>
>>>>>>>> Hi Anjali,
>>>>>>>>
>>>>>>>> That's a nice summary.
>>>>>>>>
>>>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>>>>> other way to identify application that created the view), and it might be
>>>>>>>> useful.
>>>>>>>> Why not make it required from the start?
>>>>>>>>
>>>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would
>>>>>>>> put there, so cannot comment.
>>>>>>>>
>>>>>>>> I agree there it's nice to get something out of the door, and I see
>>>>>>>> how the current proposal fits some needs already.
>>>>>>>> However, i am concerned about the proliferation of non-cross-engine
>>>>>>>> compatible views, if we do that.
>>>>>>>>
>>>>>>>> Also, if we later agree on any compatible approach (portable subset
>>>>>>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>>>> engine, it would be a breaking change.
>>>>>>>> Unless we make the compatible approach as expressive as full power
>>>>>>>> of SQL, some views that are possible to create in v1 will not be possible
>>>>>>>> to create in v2.
>>>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>>>> compatible", we may not be able to roll it out.
>>>>>>>>
>>>>>>>> > the convention of common SQL has been working for a majority of
>>>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>>>> application, joins, grouping and common aggregate and scalar function. A
>>>>>>>> few users occasionally would like to use Trino or Spark specific functions
>>>>>>>> but are sometimes able to find a way to use a function that is common to
>>>>>>>> both the engines.
>>>>>>>>
>>>>>>>>
>>>>>>>> it's an awesome summary of what constructs are necessary to be able
>>>>>>>> to define useful views, while also keep them portable.
>>>>>>>>
>>>>>>>> To be able to express column projections, simple filter
>>>>>>>> application, joins, grouping and common aggregate and scalar function in a
>>>>>>>> structured IR, how much effort do you think would be required?
>>>>>>>> We didn't really talk about downsides of a structured approach,
>>>>>>>> other than it looks complex.
>>>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue
>>>>>>>> for that. Maybe i were overly optimistic though.
>>>>>>>>
>>>>>>>>
>>>>>>>> As Jack mentioned, for engine-specific approach that's not supposed
>>>>>>>> to be consumed by multiple engines, we may be better served with approach
>>>>>>>> that's outside of Iceberg spec, like
>>>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thank you for all the comments. I will try to address them all
>>>>>>>>> here together.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>>>>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>>>>>>>>    We can all agree that all of these options are non-trivial to
>>>>>>>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>>>>>>>    and merit further discussion. I would like to suggest that we continue this
>>>>>>>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>>>>>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>>>>>>    where the views are either accessed by a single engine or multi-engine use
>>>>>>>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>>>>>>>    desirable features such as versioning of views and a common format of
>>>>>>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>>>>>>    anyone feel strongly otherwise?*
>>>>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on
>>>>>>>>>    SQL are not enforced, but are advised as best practices. The convention of
>>>>>>>>>    common SQL has been working for a majority of users. SQL features commonly
>>>>>>>>>    used are column projections, simple filter application, joins, grouping and
>>>>>>>>>    common aggregate and scalar function. A few users occasionally would like
>>>>>>>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>>>>>>>    way to use a function that is common to both the engines.
>>>>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic
>>>>>>>>>    and hence were picked for storing view schema. Thinking further, the schema
>>>>>>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>>>>    - @Jacques Table references in the views can be arbitrary
>>>>>>>>>    objects such as tables from other catalogs or elasticsearch tables etc. I
>>>>>>>>>    will clarify it in the spec.
>>>>>>>>>
>>>>>>>>> I will work on incorporating all the comments in the spec and make
>>>>>>>>> the next revision available for review soon.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Anjali.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>>>>
>>>>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>>>>> I.e. it would be useful if there was some external app (or a
>>>>>>>>>> human being) that would write those alternative SQLs for each dialect.
>>>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and
>>>>>>>>>> Trino, or Trino writing SQL for Trino and Spark.
>>>>>>>>>>
>>>>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar
>>>>>>>>>> file (it's really well written and resembles spec quite well), you made a
>>>>>>>>>> good point that grammar is only part of the problem. Coercions, function
>>>>>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>>>>>> analysis that goes beyond just grammar.
>>>>>>>>>> In fact, column scoping rules -- while clearly defined by the
>>>>>>>>>> spec -- may easily differ between engines (pretty usual).
>>>>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>>>>>> Before we embark on such a journey, it would be interesting to
>>>>>>>>>> hear @Martin Traverso <ma...@starburstdata.com> 's thoughts on
>>>>>>>>>> feasibility though.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Some thoughts...
>>>>>>>>>>>
>>>>>>>>>>>    - In general, many engines want (or may require) a resolved
>>>>>>>>>>>    sql field. This--at minimum--typically includes star expansion since
>>>>>>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>>>>>>>    logical definition even if the underlying table changes). This may also
>>>>>>>>>>>    include a replacement of relative object names to absolute object names
>>>>>>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>>>>>>>    both of these things.
>>>>>>>>>>>    - It isn't clear in the spec whether the table references
>>>>>>>>>>>    used in views are restricted to other Iceberg objects or can be arbitrary
>>>>>>>>>>>    objects in the context of a particular engine. Maybe I missed this? For
>>>>>>>>>>>    example, can I have a Trino engine view that references an Elasticsearch
>>>>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead
>>>>>>>>>>>    to unintended consequences. I appreciate the attraction to it but I think
>>>>>>>>>>>    it would either create artificial barriers around the types of SQL that are
>>>>>>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>>>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>>>>    libraries and not a specification. There are things that can be specified
>>>>>>>>>>>    in Calcite but in general it doesn't have formal specification as a first
>>>>>>>>>>>    principle. It is more implementation as a first principle. This is in
>>>>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with well-formed
>>>>>>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>>>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>>>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>>>>>>>    fully resolved/comprehended.
>>>>>>>>>>>
>>>>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>>>>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>>>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>>>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>>>>>>> of those two domains.)
>>>>>>>>>>>
>>>>>>>>>>> Some other notes:
>>>>>>>>>>>
>>>>>>>>>>>    - Calcite does provide a nice reference document [3] but it
>>>>>>>>>>>    is not sufficient to implement what is necessary for
>>>>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>>>>    - Projects like Coral [4] are interesting here but even
>>>>>>>>>>>    Coral is based roughly on "HiveQL" which also doesn't have a formal
>>>>>>>>>>>    specification process outside of the Hive version you're running. See this
>>>>>>>>>>>    thread in Coral slack [5]
>>>>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>>>>>    closer to specification based [7] than Calcite but is much less popular in
>>>>>>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness closely,
>>>>>>>>>>>    a strength of Calcite.
>>>>>>>>>>>    - One of the other problems with building against an
>>>>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>>>>    actually implements that in its entirety and it has far too many places
>>>>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>>>>> [5]
>>>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>>>>> [7]
>>>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>>>>> [8]
>>>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>>>>>>
>>>>>>>>>>>> I would personally prefer cross-engine interoperability as a
>>>>>>>>>>>> goal and get the spec in the right structure in the initial release,
>>>>>>>>>>>> because:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical
>>>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that already
>>>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this view
>>>>>>>>>>>> feature. I can do what is now done for Trino to use views with Iceberg. I
>>>>>>>>>>>> can also just use a table property to indicate the table is a view and
>>>>>>>>>>>> store the view SQL as a table property and do my own thing in any query
>>>>>>>>>>>> engine to support all the view features. One of the most valuable and
>>>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to be
>>>>>>>>>>>> created in one engine and read by another. Not supporting cross-engine
>>>>>>>>>>>> compatibility feels like losing a lot of value to me.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the view SQL
>>>>>>>>>>>> that can come from any query engine. If the engine already needs to convert
>>>>>>>>>>>> the view schema to iceberg shema, it should just do the same for the view
>>>>>>>>>>>> SQL.
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding the way to achieve it, I think it comes to either
>>>>>>>>>>>> Apache Calcite (or some other third party alternative I don't know) or our
>>>>>>>>>>>> own implementation of some intermediate representation. I don't have a very
>>>>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with
>>>>>>>>>>>> this kind of issue, but my personal concern is that the integration is
>>>>>>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested,
>>>>>>>>>>>> is to have a native lightweight logical query structure representation of
>>>>>>>>>>>> the view SQL and store that instead of the SQL string. We already deal with
>>>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>>>>>>> extend on that to support this use case. Different engines can build this
>>>>>>>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the engine
>>>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>>>>>>> this way, the engine that creates the view can still read the SQL directly
>>>>>>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>>>>>> representation it can fallback to when the engine's key is not found in the
>>>>>>>>>>>> map. If we want to make incremental progress and delay the design for the
>>>>>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>>>>>> instead of just a single string.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>>>>> view-related work!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that solving cross-engine compatibility problem may
>>>>>>>>>>>>> not be primary feature today, I am concerned that not thinking about this
>>>>>>>>>>>>> from the start may "tunnel" us into a wrong direction.
>>>>>>>>>>>>> Cross-engine compatible views would be such a cool feature
>>>>>>>>>>>>> that it is hard to just let it pass.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>>>>>> However, with new IR being focused on compatible
>>>>>>>>>>>>> representation, and not being tied to anything are actually good things.
>>>>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code for this
>>>>>>>>>>>>> could be pretty simple.
>>>>>>>>>>>>>
>>>>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>
>>>>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that,
>>>>>>>>>>>>> or is this just a convention?
>>>>>>>>>>>>>
>>>>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>>>>>>> views?
>>>>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>>>>> representation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> PF
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>>>>> first list once the spec has been approved by the community. My hope is
>>>>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS if anyone is interested in collaborating on engine
>>>>>>>>>>>>>> agnostic views please reach out. I am keen on exploring this topic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks to everybody who has offered help with
>>>>>>>>>>>>>>> implementation. The spec as it is proposed in the doc has been implemented
>>>>>>>>>>>>>>> and is in use at Netflix (currently on Iceberg 0.9). Once we close the
>>>>>>>>>>>>>>> spec, we will rebase our code to Iceberg-0.12 and incorporate changes to
>>>>>>>>>>>>>>> format and other feedback from the community and should be able to make
>>>>>>>>>>>>>>> this MVP implementation available quickly as a PR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A few areas that we have not yet worked on and would love
>>>>>>>>>>>>>>> for the community to help are:
>>>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>>>>> version or time
>>>>>>>>>>>>>>> 2. History table: A system table implementation for
>>>>>>>>>>>>>>> $versions similar to the $snapshots table in order to display the history
>>>>>>>>>>>>>>> of a view
>>>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>>>>> previous version
>>>>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One comment that is worth a broader discussion is the
>>>>>>>>>>>>>>> dialect of the SQL stored in the view metadata. The purpose of the spec is
>>>>>>>>>>>>>>> to provide a storage format for view metadata and APIs to access that
>>>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal question and is
>>>>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect
>>>>>>>>>>>>>>> of the SQL. This allows for any desired dialect, but no improved
>>>>>>>>>>>>>>> cross-engine operability
>>>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Given that the view metadata is json, it is easily
>>>>>>>>>>>>>>> extendable to incorporate any new fields needed to make the SQL truly
>>>>>>>>>>>>>>> compatible across engines.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>> Anjali
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>>>>> Of course, this is by no means usable by other query
>>>>>>>>>>>>>>>> engines.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Anjali, your document does not talk much about
>>>>>>>>>>>>>>>> compatibility between query engines.
>>>>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning
>>>>>>>>>>>>>>>> from a query engine to a better one.
>>>>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible
>>>>>>>>>>>>>>>> parsing and analysis.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <
>>>>>>>>>>>>>>>> rymurr@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to chat more/help
>>>>>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec
>>>>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been implemented [1]
>>>>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to
>>>>>>>>>>>>>>>>>> the doc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Proposal: Support for views in Iceberg

Posted by Micah Kornfield <em...@gmail.com>.

I'll point a link to this thread on dev@arrow so people not subscribed to
both can way in?  Should we maybe start a new thread about IR?

https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit
is the document that Wes wrote up that started the conversation.

On Thu, Aug 26, 2021 at 1:42 PM Ryan Blue <bl...@tabular.io> wrote:

> Micah or someone more familiar (Jacques?), could you summarize some of the
> details of the Arrow IR proposal? What are the intended use cases and
> goals? We'd want to see whether the goals align with our own.
>
> On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Small tangent:
>>
>> Regarding an IR, Apache Arrow is looking at this now [1] with several RFC
>> proposals.  It would be nice to coordinate, at least to make sure
>> translation from one IR to another isn't too onerous (in an ideal world we
>> could maybe share the same one).  Either way external feedback would be
>> useful.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=EjD0QpL9LYGYnq2jaGXo+LfeR5uw@mail.gmail.com%3e
>>
>> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> I think that the current proposal is looking good, but it is always a
>>> good idea to give people a few days to review it and bring up any issues or
>>> further discussion on the topics in this thread.
>>>
>>> I'll also add this to the next sync agenda so we can farm for dissent
>>> next week. Sometimes you can get a better read on what is still concerning
>>> in person.
>>>
>>> Ryan
>>>
>>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
>>> <an...@netflix.com.invalid> wrote:
>>>
>>>> Hello All,
>>>> I answered all the comments and made changes to the spec to reflect
>>>> them .. the doc now shows the new revision (link here:
>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>>>> ).
>>>> Please diff with the version named 'Spec v1' in order to see the
>>>> deltas. Please let me know if any of your comments are not
>>>> satisfactorily addressed.
>>>>
>>>> @Ryan, Thank you for the insightful feedback, especially around the use
>>>> of Iceberg data types for schema and possible evolution.
>>>> I agree with your comment: *"I think we should move forward with the
>>>> proposal as it is, and pursue type annotations and possibly IR in
>>>> parallel."*
>>>> How do we achieve consensus/farm for dissent and move forward with the
>>>> proposal?
>>>>
>>>> thanks,
>>>> Anjali.
>>>>
>>>>
>>>>
>>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>>>> discussion here with everyone.
>>>>>>
>>>>>> For the discussion about SQL dialect, I think that the right first
>>>>>> step is to capture the SQL or query dialect. That will give us the most
>>>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>>>> convert and engines that don’t can see if the SQL is valid and can be used.
>>>>>>
>>>>>> I think that the idea to create a minimal IR is an interesting one,
>>>>>> but can be added later. We will always need to record the SQL and dialect,
>>>>>> even if we translate to IR because users are going to configure views using
>>>>>> SQL. Uses like showing view history or debugging need to show the original
>>>>>> SQL, plus relevant information like where it was created and the SQL
>>>>>> dialect. We should be able to add this later by adding additional metadata
>>>>>> to the view definition. I don’t think that it would introduce breaking
>>>>>> changes to add a common representation that can be optionally consumed.
>>>>>>
>>>>>> Let’s continue talking about a minimal IR, separately. View
>>>>>> translation is a hard problem. Right now, to get views across engines we
>>>>>> have to compromise confidence. IR is a way to have strong confidence, but
>>>>>> with limited expressibility. I think that’s a good trade in a lot of cases
>>>>>> and is worth pursuing, even if it will take a long time.
>>>>>>
>>>>>> Jacques makes a great point about types, but I think that the right
>>>>>> option here is to continue using Iceberg types. We’ve already had
>>>>>> discussions about whether Iceberg should support annotating types with
>>>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>>>> also providing compatibility across engines: char(n) is not
>>>>>> necessarily supported everywhere and mapping it to string will make
>>>>>> sense in most places. The schema is primarily used to validate that the
>>>>>> data produced by the query hasn’t changed and that is more about the number
>>>>>> of columns in structs and the names of fields rather than exact types. We
>>>>>> can fix up types when substituting without losing too much: if the SQL
>>>>>> produces a varchar(10) field that the view metadata says is a string,
>>>>>> then it’s okay that it is varchar(10). There is some loss in that we
>>>>>> don’t know if it was originally varchar(5), but I think that this is
>>>>>> not going to cause too many issues. Not all engines will even validate that
>>>>>> the schema has not changed, since it could be valid to use select *
>>>>>> from x where ... and allow new fields to appear.
>>>>>>
>>>>>> Right now, I think we should move forward with the proposal as it is,
>>>>>> and pursue type annotations and possibly IR in parallel. Does that sound
>>>>>> reasonable to everyone?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>>>> piotr@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi Anjali,
>>>>>>>
>>>>>>> That's a nice summary.
>>>>>>>
>>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>>>> other way to identify application that created the view), and it might be
>>>>>>> useful.
>>>>>>> Why not make it required from the start?
>>>>>>>
>>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would
>>>>>>> put there, so cannot comment.
>>>>>>>
>>>>>>> I agree there it's nice to get something out of the door, and I see
>>>>>>> how the current proposal fits some needs already.
>>>>>>> However, i am concerned about the proliferation of non-cross-engine
>>>>>>> compatible views, if we do that.
>>>>>>>
>>>>>>> Also, if we later agree on any compatible approach (portable subset
>>>>>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>>> engine, it would be a breaking change.
>>>>>>> Unless we make the compatible approach as expressive as full power
>>>>>>> of SQL, some views that are possible to create in v1 will not be possible
>>>>>>> to create in v2.
>>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>>> compatible", we may not be able to roll it out.
>>>>>>>
>>>>>>> > the convention of common SQL has been working for a majority of
>>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>>> application, joins, grouping and common aggregate and scalar function. A
>>>>>>> few users occasionally would like to use Trino or Spark specific functions
>>>>>>> but are sometimes able to find a way to use a function that is common to
>>>>>>> both the engines.
>>>>>>>
>>>>>>>
>>>>>>> it's an awesome summary of what constructs are necessary to be able
>>>>>>> to define useful views, while also keep them portable.
>>>>>>>
>>>>>>> To be able to express column projections, simple filter application,
>>>>>>> joins, grouping and common aggregate and scalar function in a structured
>>>>>>> IR, how much effort do you think would be required?
>>>>>>> We didn't really talk about downsides of a structured approach,
>>>>>>> other than it looks complex.
>>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue
>>>>>>> for that. Maybe i were overly optimistic though.
>>>>>>>
>>>>>>>
>>>>>>> As Jack mentioned, for engine-specific approach that's not supposed
>>>>>>> to be consumed by multiple engines, we may be better served with approach
>>>>>>> that's outside of Iceberg spec, like
>>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Thank you for all the comments. I will try to address them all here
>>>>>>>> together.
>>>>>>>>
>>>>>>>>
>>>>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>>>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>>>>>>>    We can all agree that all of these options are non-trivial to
>>>>>>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>>>>>>    and merit further discussion. I would like to suggest that we continue this
>>>>>>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>>>>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>>>>>    where the views are either accessed by a single engine or multi-engine use
>>>>>>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>>>>>>    desirable features such as versioning of views and a common format of
>>>>>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>>>>>    anyone feel strongly otherwise?*
>>>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on
>>>>>>>>    SQL are not enforced, but are advised as best practices. The convention of
>>>>>>>>    common SQL has been working for a majority of users. SQL features commonly
>>>>>>>>    used are column projections, simple filter application, joins, grouping and
>>>>>>>>    common aggregate and scalar function. A few users occasionally would like
>>>>>>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>>>>>>    way to use a function that is common to both the engines.
>>>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>>>>>    hence were picked for storing view schema. Thinking further, the schema
>>>>>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>>>    - @Jacques Table references in the views can be arbitrary
>>>>>>>>    objects such as tables from other catalogs or elasticsearch tables etc. I
>>>>>>>>    will clarify it in the spec.
>>>>>>>>
>>>>>>>> I will work on incorporating all the comments in the spec and make
>>>>>>>> the next revision available for review soon.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Anjali.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>>>
>>>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>>>> I.e. it would be useful if there was some external app (or a human
>>>>>>>>> being) that would write those alternative SQLs for each dialect.
>>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and
>>>>>>>>> Trino, or Trino writing SQL for Trino and Spark.
>>>>>>>>>
>>>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar
>>>>>>>>> file (it's really well written and resembles spec quite well), you made a
>>>>>>>>> good point that grammar is only part of the problem. Coercions, function
>>>>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>>>>> analysis that goes beyond just grammar.
>>>>>>>>> In fact, column scoping rules -- while clearly defined by the spec
>>>>>>>>> -- may easily differ between engines (pretty usual).
>>>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>>>>> Before we embark on such a journey, it would be interesting to
>>>>>>>>> hear @Martin Traverso <ma...@starburstdata.com> 's thoughts on
>>>>>>>>> feasibility though.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> PF
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Some thoughts...
>>>>>>>>>>
>>>>>>>>>>    - In general, many engines want (or may require) a resolved
>>>>>>>>>>    sql field. This--at minimum--typically includes star expansion since
>>>>>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>>>>>>    logical definition even if the underlying table changes). This may also
>>>>>>>>>>    include a replacement of relative object names to absolute object names
>>>>>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>>>>>>    both of these things.
>>>>>>>>>>    - It isn't clear in the spec whether the table references
>>>>>>>>>>    used in views are restricted to other Iceberg objects or can be arbitrary
>>>>>>>>>>    objects in the context of a particular engine. Maybe I missed this? For
>>>>>>>>>>    example, can I have a Trino engine view that references an Elasticsearch
>>>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead
>>>>>>>>>>    to unintended consequences. I appreciate the attraction to it but I think
>>>>>>>>>>    it would either create artificial barriers around the types of SQL that are
>>>>>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>>>    libraries and not a specification. There are things that can be specified
>>>>>>>>>>    in Calcite but in general it doesn't have formal specification as a first
>>>>>>>>>>    principle. It is more implementation as a first principle. This is in
>>>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with well-formed
>>>>>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>>>>>>    fully resolved/comprehended.
>>>>>>>>>>
>>>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>>>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>>>>>> of those two domains.)
>>>>>>>>>>
>>>>>>>>>> Some other notes:
>>>>>>>>>>
>>>>>>>>>>    - Calcite does provide a nice reference document [3] but it
>>>>>>>>>>    is not sufficient to implement what is necessary for
>>>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>>>    - Projects like Coral [4] are interesting here but even Coral
>>>>>>>>>>    is based roughly on "HiveQL" which also doesn't have a formal specification
>>>>>>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>>>>>>    Coral slack [5]
>>>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>>>>    closer to specification based [7] than Calcite but is much less popular in
>>>>>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness closely,
>>>>>>>>>>    a strength of Calcite.
>>>>>>>>>>    - One of the other problems with building against an
>>>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>>>    actually implements that in its entirety and it has far too many places
>>>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>>>> [5]
>>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>>>> [7]
>>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>>>> [8]
>>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>>>>>
>>>>>>>>>>> I would personally prefer cross-engine interoperability as a
>>>>>>>>>>> goal and get the spec in the right structure in the initial release,
>>>>>>>>>>> because:
>>>>>>>>>>>
>>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical
>>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that already
>>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this view
>>>>>>>>>>> feature. I can do what is now done for Trino to use views with Iceberg. I
>>>>>>>>>>> can also just use a table property to indicate the table is a view and
>>>>>>>>>>> store the view SQL as a table property and do my own thing in any query
>>>>>>>>>>> engine to support all the view features. One of the most valuable and
>>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to be
>>>>>>>>>>> created in one engine and read by another. Not supporting cross-engine
>>>>>>>>>>> compatibility feels like losing a lot of value to me.
>>>>>>>>>>>
>>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the view SQL
>>>>>>>>>>> that can come from any query engine. If the engine already needs to convert
>>>>>>>>>>> the view schema to iceberg shema, it should just do the same for the view
>>>>>>>>>>> SQL.
>>>>>>>>>>>
>>>>>>>>>>> Regarding the way to achieve it, I think it comes to either
>>>>>>>>>>> Apache Calcite (or some other third party alternative I don't know) or our
>>>>>>>>>>> own implementation of some intermediate representation. I don't have a very
>>>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>>>
>>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with
>>>>>>>>>>> this kind of issue, but my personal concern is that the integration is
>>>>>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>>>
>>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is
>>>>>>>>>>> to have a native lightweight logical query structure representation of the
>>>>>>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>>>>>> extend on that to support this use case. Different engines can build this
>>>>>>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>>>>>>
>>>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the engine
>>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>>>>>> this way, the engine that creates the view can still read the SQL directly
>>>>>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>>>>> representation it can fallback to when the engine's key is not found in the
>>>>>>>>>>> map. If we want to make incremental progress and delay the design for the
>>>>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>>>>> instead of just a single string.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Jack Ye
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>>>> view-related work!
>>>>>>>>>>>>
>>>>>>>>>>>> I agree that solving cross-engine compatibility problem may not
>>>>>>>>>>>> be primary feature today, I am concerned that not thinking about this from
>>>>>>>>>>>> the start may "tunnel" us into a wrong direction.
>>>>>>>>>>>> Cross-engine compatible views would be such a cool feature that
>>>>>>>>>>>> it is hard to just let it pass.
>>>>>>>>>>>>
>>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>>>>> However, with new IR being focused on compatible
>>>>>>>>>>>> representation, and not being tied to anything are actually good things.
>>>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code for this
>>>>>>>>>>>> could be pretty simple.
>>>>>>>>>>>>
>>>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>
>>>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that,
>>>>>>>>>>>> or is this just a convention?
>>>>>>>>>>>>
>>>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>>>>>> views?
>>>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>>>> representation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> PF
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>>>> first list once the spec has been approved by the community. My hope is
>>>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks to everybody who has offered help with implementation.
>>>>>>>>>>>>>> The spec as it is proposed in the doc has been implemented and is in use at
>>>>>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will rebase
>>>>>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A few areas that we have not yet worked on and would love for
>>>>>>>>>>>>>> the community to help are:
>>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>>>> version or time
>>>>>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>>>>>> similar to the $snapshots table in order to display the history of a view
>>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>>>> previous version
>>>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One comment that is worth a broader discussion is the dialect
>>>>>>>>>>>>>> of the SQL stored in the view metadata. The purpose of the spec is to
>>>>>>>>>>>>>> provide a storage format for view metadata and APIs to access that
>>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal question and is
>>>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect
>>>>>>>>>>>>>> of the SQL. This allows for any desired dialect, but no improved
>>>>>>>>>>>>>> cross-engine operability
>>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Given that the view metadata is json, it is easily extendable
>>>>>>>>>>>>>> to incorporate any new fields needed to make the SQL truly compatible
>>>>>>>>>>>>>> across engines.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>> Anjali
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>>>>>> between query engines.
>>>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning
>>>>>>>>>>>>>>> from a query engine to a better one.
>>>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible
>>>>>>>>>>>>>>> parsing and analysis.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <
>>>>>>>>>>>>>>> rymurr@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to chat more/help
>>>>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec
>>>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been implemented [1]
>>>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the
>>>>>>>>>>>>>>>>> doc.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Proposal: Support for views in Iceberg

Posted by Ryan Blue <bl...@tabular.io>.

Micah or someone more familiar (Jacques?), could you summarize some of the
details of the Arrow IR proposal? What are the intended use cases and
goals? We'd want to see whether the goals align with our own.

On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <em...@gmail.com>
wrote:

> Small tangent:
>
> Regarding an IR, Apache Arrow is looking at this now [1] with several RFC
> proposals.  It would be nice to coordinate, at least to make sure
> translation from one IR to another isn't too onerous (in an ideal world we
> could maybe share the same one).  Either way external feedback would be
> useful.
>
> Cheers,
> Micah
>
>
> [1]
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=EjD0QpL9LYGYnq2jaGXo+LfeR5uw@mail.gmail.com%3e
>
> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> I think that the current proposal is looking good, but it is always a
>> good idea to give people a few days to review it and bring up any issues or
>> further discussion on the topics in this thread.
>>
>> I'll also add this to the next sync agenda so we can farm for dissent
>> next week. Sometimes you can get a better read on what is still concerning
>> in person.
>>
>> Ryan
>>
>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
>> <an...@netflix.com.invalid> wrote:
>>
>>> Hello All,
>>> I answered all the comments and made changes to the spec to reflect them
>>> .. the doc now shows the new revision (link here:
>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>>> ).
>>> Please diff with the version named 'Spec v1' in order to see the deltas.
>>> Please let me know if any of your comments are not
>>> satisfactorily addressed.
>>>
>>> @Ryan, Thank you for the insightful feedback, especially around the use
>>> of Iceberg data types for schema and possible evolution.
>>> I agree with your comment: *"I think we should move forward with the
>>> proposal as it is, and pursue type annotations and possibly IR in
>>> parallel."*
>>> How do we achieve consensus/farm for dissent and move forward with the
>>> proposal?
>>>
>>> thanks,
>>> Anjali.
>>>
>>>
>>>
>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jz...@apache.org> wrote:
>>>
>>>> +1
>>>>
>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>>> discussion here with everyone.
>>>>>
>>>>> For the discussion about SQL dialect, I think that the right first
>>>>> step is to capture the SQL or query dialect. That will give us the most
>>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>>> convert and engines that don’t can see if the SQL is valid and can be used.
>>>>>
>>>>> I think that the idea to create a minimal IR is an interesting one,
>>>>> but can be added later. We will always need to record the SQL and dialect,
>>>>> even if we translate to IR because users are going to configure views using
>>>>> SQL. Uses like showing view history or debugging need to show the original
>>>>> SQL, plus relevant information like where it was created and the SQL
>>>>> dialect. We should be able to add this later by adding additional metadata
>>>>> to the view definition. I don’t think that it would introduce breaking
>>>>> changes to add a common representation that can be optionally consumed.
>>>>>
>>>>> Let’s continue talking about a minimal IR, separately. View
>>>>> translation is a hard problem. Right now, to get views across engines we
>>>>> have to compromise confidence. IR is a way to have strong confidence, but
>>>>> with limited expressibility. I think that’s a good trade in a lot of cases
>>>>> and is worth pursuing, even if it will take a long time.
>>>>>
>>>>> Jacques makes a great point about types, but I think that the right
>>>>> option here is to continue using Iceberg types. We’ve already had
>>>>> discussions about whether Iceberg should support annotating types with
>>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>>> also providing compatibility across engines: char(n) is not
>>>>> necessarily supported everywhere and mapping it to string will make
>>>>> sense in most places. The schema is primarily used to validate that the
>>>>> data produced by the query hasn’t changed and that is more about the number
>>>>> of columns in structs and the names of fields rather than exact types. We
>>>>> can fix up types when substituting without losing too much: if the SQL
>>>>> produces a varchar(10) field that the view metadata says is a string,
>>>>> then it’s okay that it is varchar(10). There is some loss in that we
>>>>> don’t know if it was originally varchar(5), but I think that this is
>>>>> not going to cause too many issues. Not all engines will even validate that
>>>>> the schema has not changed, since it could be valid to use select *
>>>>> from x where ... and allow new fields to appear.
>>>>>
>>>>> Right now, I think we should move forward with the proposal as it is,
>>>>> and pursue type annotations and possibly IR in parallel. Does that sound
>>>>> reasonable to everyone?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>>> piotr@starburstdata.com> wrote:
>>>>>
>>>>>> Hi Anjali,
>>>>>>
>>>>>> That's a nice summary.
>>>>>>
>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>>> other way to identify application that created the view), and it might be
>>>>>> useful.
>>>>>> Why not make it required from the start?
>>>>>>
>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would
>>>>>> put there, so cannot comment.
>>>>>>
>>>>>> I agree there it's nice to get something out of the door, and I see
>>>>>> how the current proposal fits some needs already.
>>>>>> However, i am concerned about the proliferation of non-cross-engine
>>>>>> compatible views, if we do that.
>>>>>>
>>>>>> Also, if we later agree on any compatible approach (portable subset
>>>>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>> engine, it would be a breaking change.
>>>>>> Unless we make the compatible approach as expressive as full power of
>>>>>> SQL, some views that are possible to create in v1 will not be possible to
>>>>>> create in v2.
>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>> compatible", we may not be able to roll it out.
>>>>>>
>>>>>> > the convention of common SQL has been working for a majority of
>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>> application, joins, grouping and common aggregate and scalar function. A
>>>>>> few users occasionally would like to use Trino or Spark specific functions
>>>>>> but are sometimes able to find a way to use a function that is common to
>>>>>> both the engines.
>>>>>>
>>>>>>
>>>>>> it's an awesome summary of what constructs are necessary to be able
>>>>>> to define useful views, while also keep them portable.
>>>>>>
>>>>>> To be able to express column projections, simple filter application,
>>>>>> joins, grouping and common aggregate and scalar function in a structured
>>>>>> IR, how much effort do you think would be required?
>>>>>> We didn't really talk about downsides of a structured approach, other
>>>>>> than it looks complex.
>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>>>>>> that. Maybe i were overly optimistic though.
>>>>>>
>>>>>>
>>>>>> As Jack mentioned, for engine-specific approach that's not supposed
>>>>>> to be consumed by multiple engines, we may be better served with approach
>>>>>> that's outside of Iceberg spec, like
>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> PF
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thank you for all the comments. I will try to address them all here
>>>>>>> together.
>>>>>>>
>>>>>>>
>>>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>>>>>>    We can all agree that all of these options are non-trivial to
>>>>>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>>>>>    and merit further discussion. I would like to suggest that we continue this
>>>>>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>>>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>>>>    where the views are either accessed by a single engine or multi-engine use
>>>>>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>>>>>    desirable features such as versioning of views and a common format of
>>>>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>>>>    anyone feel strongly otherwise?*
>>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on
>>>>>>>    SQL are not enforced, but are advised as best practices. The convention of
>>>>>>>    common SQL has been working for a majority of users. SQL features commonly
>>>>>>>    used are column projections, simple filter application, joins, grouping and
>>>>>>>    common aggregate and scalar function. A few users occasionally would like
>>>>>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>>>>>    way to use a function that is common to both the engines.
>>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>>>>    hence were picked for storing view schema. Thinking further, the schema
>>>>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>>    - @Jacques Table references in the views can be arbitrary
>>>>>>>    objects such as tables from other catalogs or elasticsearch tables etc. I
>>>>>>>    will clarify it in the spec.
>>>>>>>
>>>>>>> I will work on incorporating all the comments in the spec and make
>>>>>>> the next revision available for review soon.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Anjali.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>>
>>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>>> I.e. it would be useful if there was some external app (or a human
>>>>>>>> being) that would write those alternative SQLs for each dialect.
>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino,
>>>>>>>> or Trino writing SQL for Trino and Spark.
>>>>>>>>
>>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>>>>>> (it's really well written and resembles spec quite well), you made a good
>>>>>>>> point that grammar is only part of the problem. Coercions, function
>>>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>>>> analysis that goes beyond just grammar.
>>>>>>>> In fact, column scoping rules -- while clearly defined by the spec
>>>>>>>> -- may easily differ between engines (pretty usual).
>>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>>>> Before we embark on such a journey, it would be interesting to hear @Martin
>>>>>>>> Traverso <ma...@starburstdata.com> 's thoughts on feasibility
>>>>>>>> though.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Some thoughts...
>>>>>>>>>
>>>>>>>>>    - In general, many engines want (or may require) a resolved
>>>>>>>>>    sql field. This--at minimum--typically includes star expansion since
>>>>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>>>>>    logical definition even if the underlying table changes). This may also
>>>>>>>>>    include a replacement of relative object names to absolute object names
>>>>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>>>>>    both of these things.
>>>>>>>>>    - It isn't clear in the spec whether the table references used
>>>>>>>>>    in views are restricted to other Iceberg objects or can be arbitrary
>>>>>>>>>    objects in the context of a particular engine. Maybe I missed this? For
>>>>>>>>>    example, can I have a Trino engine view that references an Elasticsearch
>>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>>>>>    unintended consequences. I appreciate the attraction to it but I think it
>>>>>>>>>    would either create artificial barriers around the types of SQL that are
>>>>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>>    libraries and not a specification. There are things that can be specified
>>>>>>>>>    in Calcite but in general it doesn't have formal specification as a first
>>>>>>>>>    principle. It is more implementation as a first principle. This is in
>>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with well-formed
>>>>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>>>>>    fully resolved/comprehended.
>>>>>>>>>
>>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>>>>> of those two domains.)
>>>>>>>>>
>>>>>>>>> Some other notes:
>>>>>>>>>
>>>>>>>>>    - Calcite does provide a nice reference document [3] but it is
>>>>>>>>>    not sufficient to implement what is necessary for
>>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>>    - Projects like Coral [4] are interesting here but even Coral
>>>>>>>>>    is based roughly on "HiveQL" which also doesn't have a formal specification
>>>>>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>>>>>    Coral slack [5]
>>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>>>    closer to specification based [7] than Calcite but is much less popular in
>>>>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness closely,
>>>>>>>>>    a strength of Calcite.
>>>>>>>>>    - One of the other problems with building against an
>>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>>    actually implements that in its entirety and it has far too many places
>>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>> [2]
>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>>> [5]
>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>>> [7]
>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>>> [8]
>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>>
>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>>>>
>>>>>>>>>> I would personally prefer cross-engine interoperability as a goal
>>>>>>>>>> and get the spec in the right structure in the initial release, because:
>>>>>>>>>>
>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical
>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that already
>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this view
>>>>>>>>>> feature. I can do what is now done for Trino to use views with Iceberg. I
>>>>>>>>>> can also just use a table property to indicate the table is a view and
>>>>>>>>>> store the view SQL as a table property and do my own thing in any query
>>>>>>>>>> engine to support all the view features. One of the most valuable and
>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to be
>>>>>>>>>> created in one engine and read by another. Not supporting cross-engine
>>>>>>>>>> compatibility feels like losing a lot of value to me.
>>>>>>>>>>
>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the view SQL
>>>>>>>>>> that can come from any query engine. If the engine already needs to convert
>>>>>>>>>> the view schema to iceberg shema, it should just do the same for the view
>>>>>>>>>> SQL.
>>>>>>>>>>
>>>>>>>>>> Regarding the way to achieve it, I think it comes to either
>>>>>>>>>> Apache Calcite (or some other third party alternative I don't know) or our
>>>>>>>>>> own implementation of some intermediate representation. I don't have a very
>>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>>
>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>>
>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is
>>>>>>>>>> to have a native lightweight logical query structure representation of the
>>>>>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>>>>> extend on that to support this use case. Different engines can build this
>>>>>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>>>>>
>>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the engine
>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>>>>> this way, the engine that creates the view can still read the SQL directly
>>>>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>>>> representation it can fallback to when the engine's key is not found in the
>>>>>>>>>> map. If we want to make incremental progress and delay the design for the
>>>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>>>> instead of just a single string.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Jack Ye
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>>> view-related work!
>>>>>>>>>>>
>>>>>>>>>>> I agree that solving cross-engine compatibility problem may not
>>>>>>>>>>> be primary feature today, I am concerned that not thinking about this from
>>>>>>>>>>> the start may "tunnel" us into a wrong direction.
>>>>>>>>>>> Cross-engine compatible views would be such a cool feature that
>>>>>>>>>>> it is hard to just let it pass.
>>>>>>>>>>>
>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>>>> However, with new IR being focused on compatible representation,
>>>>>>>>>>> and not being tied to anything are actually good things.
>>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code for this
>>>>>>>>>>> could be pretty simple.
>>>>>>>>>>>
>>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>
>>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or
>>>>>>>>>>> is this just a convention?
>>>>>>>>>>>
>>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>>>>> views?
>>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>>> representation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> PF
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>>
>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>>> first list once the spec has been approved by the community. My hope is
>>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities.
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks to everybody who has offered help with implementation.
>>>>>>>>>>>>> The spec as it is proposed in the doc has been implemented and is in use at
>>>>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will rebase
>>>>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A few areas that we have not yet worked on and would love for
>>>>>>>>>>>>> the community to help are:
>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>>> version or time
>>>>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>>>>> similar to the $snapshots table in order to display the history of a view
>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>>> previous version
>>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One comment that is worth a broader discussion is the dialect
>>>>>>>>>>>>> of the SQL stored in the view metadata. The purpose of the spec is to
>>>>>>>>>>>>> provide a storage format for view metadata and APIs to access that
>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal question and is
>>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of
>>>>>>>>>>>>> the SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>>>>>>>>> operability
>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Given that the view metadata is json, it is easily extendable
>>>>>>>>>>>>> to incorporate any new fields needed to make the SQL truly compatible
>>>>>>>>>>>>> across engines.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> regards,
>>>>>>>>>>>>> Anjali
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>>>>> between query engines.
>>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning
>>>>>>>>>>>>>> from a query engine to a better one.
>>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible parsing
>>>>>>>>>>>>>> and analysis.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <
>>>>>>>>>>>>>> rymurr@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to chat more/help
>>>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec
>>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been implemented [1]
>>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the
>>>>>>>>>>>>>>>> doc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Proposal: Support for views in Iceberg

Posted by Micah Kornfield <em...@gmail.com>.

Small tangent:

Regarding an IR, Apache Arrow is looking at this now [1] with several RFC
proposals.  It would be nice to coordinate, at least to make sure
translation from one IR to another isn't too onerous (in an ideal world we
could maybe share the same one).  Either way external feedback would be
useful.

Cheers,
Micah


[1]
https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=EjD0QpL9LYGYnq2jaGXo+LfeR5uw@mail.gmail.com%3e

On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <bl...@tabular.io> wrote:

> I think that the current proposal is looking good, but it is always a good
> idea to give people a few days to review it and bring up any issues or
> further discussion on the topics in this thread.
>
> I'll also add this to the next sync agenda so we can farm for dissent next
> week. Sometimes you can get a better read on what is still concerning in
> person.
>
> Ryan
>
> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
> <an...@netflix.com.invalid> wrote:
>
>> Hello All,
>> I answered all the comments and made changes to the spec to reflect them
>> .. the doc now shows the new revision (link here:
>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>> ).
>> Please diff with the version named 'Spec v1' in order to see the deltas.
>> Please let me know if any of your comments are not
>> satisfactorily addressed.
>>
>> @Ryan, Thank you for the insightful feedback, especially around the use
>> of Iceberg data types for schema and possible evolution.
>> I agree with your comment: *"I think we should move forward with the
>> proposal as it is, and pursue type annotations and possibly IR in
>> parallel."*
>> How do we achieve consensus/farm for dissent and move forward with the
>> proposal?
>>
>> thanks,
>> Anjali.
>>
>>
>>
>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jz...@apache.org> wrote:
>>
>>> +1
>>>
>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>> discussion here with everyone.
>>>>
>>>> For the discussion about SQL dialect, I think that the right first step
>>>> is to capture the SQL or query dialect. That will give us the most
>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>> convert and engines that don’t can see if the SQL is valid and can be used.
>>>>
>>>> I think that the idea to create a minimal IR is an interesting one, but
>>>> can be added later. We will always need to record the SQL and dialect, even
>>>> if we translate to IR because users are going to configure views using SQL.
>>>> Uses like showing view history or debugging need to show the original SQL,
>>>> plus relevant information like where it was created and the SQL dialect. We
>>>> should be able to add this later by adding additional metadata to the view
>>>> definition. I don’t think that it would introduce breaking changes to add a
>>>> common representation that can be optionally consumed.
>>>>
>>>> Let’s continue talking about a minimal IR, separately. View translation
>>>> is a hard problem. Right now, to get views across engines we have to
>>>> compromise confidence. IR is a way to have strong confidence, but with
>>>> limited expressibility. I think that’s a good trade in a lot of cases and
>>>> is worth pursuing, even if it will take a long time.
>>>>
>>>> Jacques makes a great point about types, but I think that the right
>>>> option here is to continue using Iceberg types. We’ve already had
>>>> discussions about whether Iceberg should support annotating types with
>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>> also providing compatibility across engines: char(n) is not
>>>> necessarily supported everywhere and mapping it to string will make
>>>> sense in most places. The schema is primarily used to validate that the
>>>> data produced by the query hasn’t changed and that is more about the number
>>>> of columns in structs and the names of fields rather than exact types. We
>>>> can fix up types when substituting without losing too much: if the SQL
>>>> produces a varchar(10) field that the view metadata says is a string,
>>>> then it’s okay that it is varchar(10). There is some loss in that we
>>>> don’t know if it was originally varchar(5), but I think that this is
>>>> not going to cause too many issues. Not all engines will even validate that
>>>> the schema has not changed, since it could be valid to use select *
>>>> from x where ... and allow new fields to appear.
>>>>
>>>> Right now, I think we should move forward with the proposal as it is,
>>>> and pursue type annotations and possibly IR in parallel. Does that sound
>>>> reasonable to everyone?
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>> piotr@starburstdata.com> wrote:
>>>>
>>>>> Hi Anjali,
>>>>>
>>>>> That's a nice summary.
>>>>>
>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>> other way to identify application that created the view), and it might be
>>>>> useful.
>>>>> Why not make it required from the start?
>>>>>
>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would put
>>>>> there, so cannot comment.
>>>>>
>>>>> I agree there it's nice to get something out of the door, and I see
>>>>> how the current proposal fits some needs already.
>>>>> However, i am concerned about the proliferation of non-cross-engine
>>>>> compatible views, if we do that.
>>>>>
>>>>> Also, if we later agree on any compatible approach (portable subset of
>>>>> SQL, engine-agnostic IR, etc.), then from the perspective of each engine,
>>>>> it would be a breaking change.
>>>>> Unless we make the compatible approach as expressive as full power of
>>>>> SQL, some views that are possible to create in v1 will not be possible to
>>>>> create in v2.
>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely compatible",
>>>>> we may not be able to roll it out.
>>>>>
>>>>> > the convention of common SQL has been working for a majority of
>>>>> users. SQL features commonly used are column projections, simple filter
>>>>> application, joins, grouping and common aggregate and scalar function. A
>>>>> few users occasionally would like to use Trino or Spark specific functions
>>>>> but are sometimes able to find a way to use a function that is common to
>>>>> both the engines.
>>>>>
>>>>>
>>>>> it's an awesome summary of what constructs are necessary to be able to
>>>>> define useful views, while also keep them portable.
>>>>>
>>>>> To be able to express column projections, simple filter application,
>>>>> joins, grouping and common aggregate and scalar function in a structured
>>>>> IR, how much effort do you think would be required?
>>>>> We didn't really talk about downsides of a structured approach, other
>>>>> than it looks complex.
>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>>>>> that. Maybe i were overly optimistic though.
>>>>>
>>>>>
>>>>> As Jack mentioned, for engine-specific approach that's not supposed to
>>>>> be consumed by multiple engines, we may be better served with approach
>>>>> that's outside of Iceberg spec, like
>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>
>>>>>
>>>>> Best,
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>> <an...@netflix.com.invalid> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thank you for all the comments. I will try to address them all here
>>>>>> together.
>>>>>>
>>>>>>
>>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>>>>>    We can all agree that all of these options are non-trivial to
>>>>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>>>>    and merit further discussion. I would like to suggest that we continue this
>>>>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>>>    where the views are either accessed by a single engine or multi-engine use
>>>>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>>>>    desirable features such as versioning of views and a common format of
>>>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>>>    anyone feel strongly otherwise?*
>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on SQL
>>>>>>    are not enforced, but are advised as best practices. The convention of
>>>>>>    common SQL has been working for a majority of users. SQL features commonly
>>>>>>    used are column projections, simple filter application, joins, grouping and
>>>>>>    common aggregate and scalar function. A few users occasionally would like
>>>>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>>>>    way to use a function that is common to both the engines.
>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>>>    hence were picked for storing view schema. Thinking further, the schema
>>>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>    - @Jacques Table references in the views can be arbitrary objects
>>>>>>    such as tables from other catalogs or elasticsearch tables etc. I will
>>>>>>    clarify it in the spec.
>>>>>>
>>>>>> I will work on incorporating all the comments in the spec and make
>>>>>> the next revision available for review soon.
>>>>>>
>>>>>> Regards,
>>>>>> Anjali.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>> piotr@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>
>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>> I.e. it would be useful if there was some external app (or a human
>>>>>>> being) that would write those alternative SQLs for each dialect.
>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino,
>>>>>>> or Trino writing SQL for Trino and Spark.
>>>>>>>
>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>>>>> (it's really well written and resembles spec quite well), you made a good
>>>>>>> point that grammar is only part of the problem. Coercions, function
>>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>>> analysis that goes beyond just grammar.
>>>>>>> In fact, column scoping rules -- while clearly defined by the spec
>>>>>>> -- may easily differ between engines (pretty usual).
>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>>> Before we embark on such a journey, it would be interesting to hear @Martin
>>>>>>> Traverso <ma...@starburstdata.com> 's thoughts on feasibility
>>>>>>> though.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>
>>>>>>>> Some thoughts...
>>>>>>>>
>>>>>>>>    - In general, many engines want (or may require) a resolved sql
>>>>>>>>    field. This--at minimum--typically includes star expansion since
>>>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>>>>    logical definition even if the underlying table changes). This may also
>>>>>>>>    include a replacement of relative object names to absolute object names
>>>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>>>>    both of these things.
>>>>>>>>    - It isn't clear in the spec whether the table references used
>>>>>>>>    in views are restricted to other Iceberg objects or can be arbitrary
>>>>>>>>    objects in the context of a particular engine. Maybe I missed this? For
>>>>>>>>    example, can I have a Trino engine view that references an Elasticsearch
>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>>>>    unintended consequences. I appreciate the attraction to it but I think it
>>>>>>>>    would either create artificial barriers around the types of SQL that are
>>>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>    libraries and not a specification. There are things that can be specified
>>>>>>>>    in Calcite but in general it doesn't have formal specification as a first
>>>>>>>>    principle. It is more implementation as a first principle. This is in
>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with well-formed
>>>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>>>>    fully resolved/comprehended.
>>>>>>>>
>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>>>> of those two domains.)
>>>>>>>>
>>>>>>>> Some other notes:
>>>>>>>>
>>>>>>>>    - Calcite does provide a nice reference document [3] but it is
>>>>>>>>    not sufficient to implement what is necessary for
>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>    - Projects like Coral [4] are interesting here but even Coral
>>>>>>>>    is based roughly on "HiveQL" which also doesn't have a formal specification
>>>>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>>>>    Coral slack [5]
>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>>    closer to specification based [7] than Calcite but is much less popular in
>>>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness closely,
>>>>>>>>    a strength of Calcite.
>>>>>>>>    - One of the other problems with building against an
>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>    actually implements that in its entirety and it has far too many places
>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> [2]
>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>> [5]
>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>> [8]
>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>
>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>>>
>>>>>>>>> I would personally prefer cross-engine interoperability as a goal
>>>>>>>>> and get the spec in the right structure in the initial release, because:
>>>>>>>>>
>>>>>>>>> 1. I believe that cross-engine compatibility is a critical feature
>>>>>>>>> of Iceberg. If I am a user of an existing data lake that already supports
>>>>>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. I
>>>>>>>>> can do what is now done for Trino to use views with Iceberg. I can also
>>>>>>>>> just use a table property to indicate the table is a view and store the
>>>>>>>>> view SQL as a table property and do my own thing in any query engine to
>>>>>>>>> support all the view features. One of the most valuable and unique features
>>>>>>>>> that Iceberg view can unlock is to allow a view to be created in one engine
>>>>>>>>> and read by another. Not supporting cross-engine compatibility feels like
>>>>>>>>> losing a lot of value to me.
>>>>>>>>>
>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the view SQL
>>>>>>>>> that can come from any query engine. If the engine already needs to convert
>>>>>>>>> the view schema to iceberg shema, it should just do the same for the view
>>>>>>>>> SQL.
>>>>>>>>>
>>>>>>>>> Regarding the way to achieve it, I think it comes to either Apache
>>>>>>>>> Calcite (or some other third party alternative I don't know) or our own
>>>>>>>>> implementation of some intermediate representation. I don't have a very
>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>
>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>
>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is
>>>>>>>>> to have a native lightweight logical query structure representation of the
>>>>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>>>> extend on that to support this use case. Different engines can build this
>>>>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>>>>
>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the engine
>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>>>> this way, the engine that creates the view can still read the SQL directly
>>>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>>> representation it can fallback to when the engine's key is not found in the
>>>>>>>>> map. If we want to make incremental progress and delay the design for the
>>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>>> instead of just a single string.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jack Ye
>>>>>>>>>
>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>> view-related work!
>>>>>>>>>>
>>>>>>>>>> I agree that solving cross-engine compatibility problem may not
>>>>>>>>>> be primary feature today, I am concerned that not thinking about this from
>>>>>>>>>> the start may "tunnel" us into a wrong direction.
>>>>>>>>>> Cross-engine compatible views would be such a cool feature that
>>>>>>>>>> it is hard to just let it pass.
>>>>>>>>>>
>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>>> However, with new IR being focused on compatible representation,
>>>>>>>>>> and not being tied to anything are actually good things.
>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>> don't need to deal with tree traversal or anything, so the code for this
>>>>>>>>>> could be pretty simple.
>>>>>>>>>>
>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>
>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or
>>>>>>>>>> is this just a convention?
>>>>>>>>>>
>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>>>> views?
>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>> representation.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>
>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>> first list once the spec has been approved by the community. My hope is
>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities.
>>>>>>>>>>>
>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks to everybody who has offered help with implementation.
>>>>>>>>>>>> The spec as it is proposed in the doc has been implemented and is in use at
>>>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will rebase
>>>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>>>
>>>>>>>>>>>> A few areas that we have not yet worked on and would love for
>>>>>>>>>>>> the community to help are:
>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>> version or time
>>>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>>>> similar to the $snapshots table in order to display the history of a view
>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>> previous version
>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>
>>>>>>>>>>>> One comment that is worth a broader discussion is the dialect
>>>>>>>>>>>> of the SQL stored in the view metadata. The purpose of the spec is to
>>>>>>>>>>>> provide a storage format for view metadata and APIs to access that
>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal question and is
>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>
>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of
>>>>>>>>>>>> the SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>>>>>>>> operability
>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>
>>>>>>>>>>>> Given that the view metadata is json, it is easily extendable
>>>>>>>>>>>> to incorporate any new fields needed to make the SQL truly compatible
>>>>>>>>>>>> across engines.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> regards,
>>>>>>>>>>>> Anjali
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>>>> between query engines.
>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>>> I treat it as a great technology supporting transitioning from
>>>>>>>>>>>>> a query engine to a better one.
>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>> This would eliminate need for cross-engine compatible parsing
>>>>>>>>>>>>> and analysis.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best
>>>>>>>>>>>>> PF
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <ry...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to chat more/help
>>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec
>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been implemented [1]
>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the
>>>>>>>>>>>>>>> doc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Proposal: Support for views in Iceberg

Posted by Ryan Blue <bl...@tabular.io>.

I think that the current proposal is looking good, but it is always a good
idea to give people a few days to review it and bring up any issues or
further discussion on the topics in this thread.

I'll also add this to the next sync agenda so we can farm for dissent next
week. Sometimes you can get a better read on what is still concerning in
person.

Ryan

On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood <an...@netflix.com.invalid>
wrote:

> Hello All,
> I answered all the comments and made changes to the spec to reflect them
> .. the doc now shows the new revision (link here:
> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
> ).
> Please diff with the version named 'Spec v1' in order to see the deltas.
> Please let me know if any of your comments are not
> satisfactorily addressed.
>
> @Ryan, Thank you for the insightful feedback, especially around the use of
> Iceberg data types for schema and possible evolution.
> I agree with your comment: *"I think we should move forward with the
> proposal as it is, and pursue type annotations and possibly IR in
> parallel."*
> How do we achieve consensus/farm for dissent and move forward with the
> proposal?
>
> thanks,
> Anjali.
>
>
>
> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jz...@apache.org> wrote:
>
>> +1
>>
>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>> discussion here with everyone.
>>>
>>> For the discussion about SQL dialect, I think that the right first step
>>> is to capture the SQL or query dialect. That will give us the most
>>> flexibility. Engines that can use Coral for translation can attempt to
>>> convert and engines that don’t can see if the SQL is valid and can be used.
>>>
>>> I think that the idea to create a minimal IR is an interesting one, but
>>> can be added later. We will always need to record the SQL and dialect, even
>>> if we translate to IR because users are going to configure views using SQL.
>>> Uses like showing view history or debugging need to show the original SQL,
>>> plus relevant information like where it was created and the SQL dialect. We
>>> should be able to add this later by adding additional metadata to the view
>>> definition. I don’t think that it would introduce breaking changes to add a
>>> common representation that can be optionally consumed.
>>>
>>> Let’s continue talking about a minimal IR, separately. View translation
>>> is a hard problem. Right now, to get views across engines we have to
>>> compromise confidence. IR is a way to have strong confidence, but with
>>> limited expressibility. I think that’s a good trade in a lot of cases and
>>> is worth pursuing, even if it will take a long time.
>>>
>>> Jacques makes a great point about types, but I think that the right
>>> option here is to continue using Iceberg types. We’ve already had
>>> discussions about whether Iceberg should support annotating types with
>>> engine-specific ones, so we have a reasonable way to improve this while
>>> also providing compatibility across engines: char(n) is not necessarily
>>> supported everywhere and mapping it to string will make sense in most
>>> places. The schema is primarily used to validate that the data produced by
>>> the query hasn’t changed and that is more about the number of columns in
>>> structs and the names of fields rather than exact types. We can fix up
>>> types when substituting without losing too much: if the SQL produces a
>>> varchar(10) field that the view metadata says is a string, then it’s
>>> okay that it is varchar(10). There is some loss in that we don’t know
>>> if it was originally varchar(5), but I think that this is not going to
>>> cause too many issues. Not all engines will even validate that the schema
>>> has not changed, since it could be valid to use select * from x where
>>> ... and allow new fields to appear.
>>>
>>> Right now, I think we should move forward with the proposal as it is,
>>> and pursue type annotations and possibly IR in parallel. Does that sound
>>> reasonable to everyone?
>>>
>>> Ryan
>>>
>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi Anjali,
>>>>
>>>> That's a nice summary.
>>>>
>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>> other way to identify application that created the view), and it might be
>>>> useful.
>>>> Why not make it required from the start?
>>>>
>>>> re "expanded/resolved SQL" -- i don't understand yet what we would put
>>>> there, so cannot comment.
>>>>
>>>> I agree there it's nice to get something out of the door, and I see how
>>>> the current proposal fits some needs already.
>>>> However, i am concerned about the proliferation of non-cross-engine
>>>> compatible views, if we do that.
>>>>
>>>> Also, if we later agree on any compatible approach (portable subset of
>>>> SQL, engine-agnostic IR, etc.), then from the perspective of each engine,
>>>> it would be a breaking change.
>>>> Unless we make the compatible approach as expressive as full power of
>>>> SQL, some views that are possible to create in v1 will not be possible to
>>>> create in v2.
>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely compatible",
>>>> we may not be able to roll it out.
>>>>
>>>> > the convention of common SQL has been working for a majority of
>>>> users. SQL features commonly used are column projections, simple filter
>>>> application, joins, grouping and common aggregate and scalar function. A
>>>> few users occasionally would like to use Trino or Spark specific functions
>>>> but are sometimes able to find a way to use a function that is common to
>>>> both the engines.
>>>>
>>>>
>>>> it's an awesome summary of what constructs are necessary to be able to
>>>> define useful views, while also keep them portable.
>>>>
>>>> To be able to express column projections, simple filter application,
>>>> joins, grouping and common aggregate and scalar function in a structured
>>>> IR, how much effort do you think would be required?
>>>> We didn't really talk about downsides of a structured approach, other
>>>> than it looks complex.
>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>>>> that. Maybe i were overly optimistic though.
>>>>
>>>>
>>>> As Jack mentioned, for engine-specific approach that's not supposed to
>>>> be consumed by multiple engines, we may be better served with approach
>>>> that's outside of Iceberg spec, like
>>>> https://github.com/trinodb/trino/pull/8540.
>>>>
>>>>
>>>> Best,
>>>> PF
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>> <an...@netflix.com.invalid> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thank you for all the comments. I will try to address them all here
>>>>> together.
>>>>>
>>>>>
>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>>>>    We can all agree that all of these options are non-trivial to
>>>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>>>    and merit further discussion. I would like to suggest that we continue this
>>>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>>    where the views are either accessed by a single engine or multi-engine use
>>>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>>>    desirable features such as versioning of views and a common format of
>>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>>    anyone feel strongly otherwise?*
>>>>>    - @Piotr  As for common views at Netflix, the restrictions on SQL
>>>>>    are not enforced, but are advised as best practices. The convention of
>>>>>    common SQL has been working for a majority of users. SQL features commonly
>>>>>    used are column projections, simple filter application, joins, grouping and
>>>>>    common aggregate and scalar function. A few users occasionally would like
>>>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>>>    way to use a function that is common to both the engines.
>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>>    hence were picked for storing view schema. Thinking further, the schema
>>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>    - @Jacques Table references in the views can be arbitrary objects
>>>>>    such as tables from other catalogs or elasticsearch tables etc. I will
>>>>>    clarify it in the spec.
>>>>>
>>>>> I will work on incorporating all the comments in the spec and make the
>>>>> next revision available for review soon.
>>>>>
>>>>> Regards,
>>>>> Anjali.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>> piotr@starburstdata.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>
>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>> I.e. it would be useful if there was some external app (or a human
>>>>>> being) that would write those alternative SQLs for each dialect.
>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino,
>>>>>> or Trino writing SQL for Trino and Spark.
>>>>>>
>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>>>> (it's really well written and resembles spec quite well), you made a good
>>>>>> point that grammar is only part of the problem. Coercions, function
>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>> analysis that goes beyond just grammar.
>>>>>> In fact, column scoping rules -- while clearly defined by the spec --
>>>>>> may easily differ between engines (pretty usual).
>>>>>> That's why i would rather lean towards some intermediate
>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>> Before we embark on such a journey, it would be interesting to hear @Martin
>>>>>> Traverso <ma...@starburstdata.com> 's thoughts on feasibility
>>>>>> though.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> PF
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>
>>>>>>> Some thoughts...
>>>>>>>
>>>>>>>    - In general, many engines want (or may require) a resolved sql
>>>>>>>    field. This--at minimum--typically includes star expansion since
>>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>>>    logical definition even if the underlying table changes). This may also
>>>>>>>    include a replacement of relative object names to absolute object names
>>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>>>    both of these things.
>>>>>>>    - It isn't clear in the spec whether the table references used
>>>>>>>    in views are restricted to other Iceberg objects or can be arbitrary
>>>>>>>    objects in the context of a particular engine. Maybe I missed this? For
>>>>>>>    example, can I have a Trino engine view that references an Elasticsearch
>>>>>>>    table stored in an Iceberg view?
>>>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>>>    unintended consequences. I appreciate the attraction to it but I think it
>>>>>>>    would either create artificial barriers around the types of SQL that are
>>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>    - It is important to remember that Calcite is a set of libraries
>>>>>>>    and not a specification. There are things that can be specified in Calcite
>>>>>>>    but in general it doesn't have formal specification as a first principle.
>>>>>>>    It is more implementation as a first principle. This is in contrast to
>>>>>>>    projects like Arrow and Iceberg, which start with well-formed
>>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>>>    fully resolved/comprehended.
>>>>>>>
>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>>> of those two domains.)
>>>>>>>
>>>>>>> Some other notes:
>>>>>>>
>>>>>>>    - Calcite does provide a nice reference document [3] but it is
>>>>>>>    not sufficient to implement what is necessary for
>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>    - Projects like Coral [4] are interesting here but even Coral is
>>>>>>>    based roughly on "HiveQL" which also doesn't have a formal specification
>>>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>>>    Coral slack [5]
>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>    closer to specification based [7] than Calcite but is much less popular in
>>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness closely,
>>>>>>>    a strength of Calcite.
>>>>>>>    - One of the other problems with building against an
>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>>>    how painful this kind of mistake can become.
>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>    actually implements that in its entirety and it has far too many places
>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>> [2]
>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>> [5]
>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>> [6] https://github.com/google/zetasql
>>>>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>> [8]
>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>
>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>>
>>>>>>>> I would personally prefer cross-engine interoperability as a goal
>>>>>>>> and get the spec in the right structure in the initial release, because:
>>>>>>>>
>>>>>>>> 1. I believe that cross-engine compatibility is a critical feature
>>>>>>>> of Iceberg. If I am a user of an existing data lake that already supports
>>>>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. I
>>>>>>>> can do what is now done for Trino to use views with Iceberg. I can also
>>>>>>>> just use a table property to indicate the table is a view and store the
>>>>>>>> view SQL as a table property and do my own thing in any query engine to
>>>>>>>> support all the view features. One of the most valuable and unique features
>>>>>>>> that Iceberg view can unlock is to allow a view to be created in one engine
>>>>>>>> and read by another. Not supporting cross-engine compatibility feels like
>>>>>>>> losing a lot of value to me.
>>>>>>>>
>>>>>>>> 2. In the view definition, it feels inconsistent to me that we have
>>>>>>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL that
>>>>>>>> can come from any query engine. If the engine already needs to convert the
>>>>>>>> view schema to iceberg shema, it should just do the same for the view SQL.
>>>>>>>>
>>>>>>>> Regarding the way to achieve it, I think it comes to either Apache
>>>>>>>> Calcite (or some other third party alternative I don't know) or our own
>>>>>>>> implementation of some intermediate representation. I don't have a very
>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>
>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>
>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is to
>>>>>>>> have a native lightweight logical query structure representation of the
>>>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>>> extend on that to support this use case. Different engines can build this
>>>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>>>
>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the engine
>>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>>> this way, the engine that creates the view can still read the SQL directly
>>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>> representation it can fallback to when the engine's key is not found in the
>>>>>>>> map. If we want to make incremental progress and delay the design for the
>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>> instead of just a single string.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>> view-related work!
>>>>>>>>>
>>>>>>>>> I agree that solving cross-engine compatibility problem may not be
>>>>>>>>> primary feature today, I am concerned that not thinking about this from the
>>>>>>>>> start may "tunnel" us into a wrong direction.
>>>>>>>>> Cross-engine compatible views would be such a cool feature that it
>>>>>>>>> is hard to just let it pass.
>>>>>>>>>
>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>> However, with new IR being focused on compatible representation,
>>>>>>>>> and not being tied to anything are actually good things.
>>>>>>>>> For example, we need to focus on JSON representation, but we don't
>>>>>>>>> need to deal with tree traversal or anything, so the code for this could be
>>>>>>>>> pretty simple.
>>>>>>>>>
>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>>>> across engines in the view definition (this is how currently Netflix uses
>>>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>>>
>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or
>>>>>>>>> is this just a convention?
>>>>>>>>>
>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>>> views?
>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>> representation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> PF
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Anjali,
>>>>>>>>>>
>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your first
>>>>>>>>>> list once the spec has been approved by the community. My hope is that the
>>>>>>>>>> final version of the view spec will make it easy to re-use existing
>>>>>>>>>> rollback/time travel/metadata etc functionalities.
>>>>>>>>>>
>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>
>>>>>>>>>>> Thanks to everybody who has offered help with implementation.
>>>>>>>>>>> The spec as it is proposed in the doc has been implemented and is in use at
>>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will rebase
>>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>>
>>>>>>>>>>> A few areas that we have not yet worked on and would love for
>>>>>>>>>>> the community to help are:
>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>> version or time
>>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>>> similar to the $snapshots table in order to display the history of a view
>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a previous
>>>>>>>>>>> version
>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>
>>>>>>>>>>> One comment that is worth a broader discussion is the dialect of
>>>>>>>>>>> the SQL stored in the view metadata. The purpose of the spec is to provide
>>>>>>>>>>> a storage format for view metadata and APIs to access that metadata. The
>>>>>>>>>>> dialect of the SQL stored is an orthogonal question and is outside the
>>>>>>>>>>> scope of this spec.
>>>>>>>>>>>
>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of
>>>>>>>>>>> the SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>>>>>>> operability
>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>> 4. Intermediate structured language of our own. (What additional
>>>>>>>>>>> functionality does it provide over Calcite?)
>>>>>>>>>>>
>>>>>>>>>>> Given that the view metadata is json, it is easily extendable to
>>>>>>>>>>> incorporate any new fields needed to make the SQL truly compatible across
>>>>>>>>>>> engines.
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> regards,
>>>>>>>>>>> Anjali
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>>
>>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>>> between query engines.
>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>
>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>> I treat it as a great technology supporting transitioning from
>>>>>>>>>>>> a query engine to a better one.
>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>> This would eliminate need for cross-engine compatible parsing
>>>>>>>>>>>> and analysis.
>>>>>>>>>>>>
>>>>>>>>>>>> Best
>>>>>>>>>>>> PF
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <ry...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to chat more/help
>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec for
>>>>>>>>>>>>>> storing view metadata in Iceberg. The proposal has been implemented [1] and
>>>>>>>>>>>>>> is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the
>>>>>>>>>>>>>> doc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>> --
>> John Zhuge
>>
>

-- 
Ryan Blue
Tabular

Re: Proposal: Support for views in Iceberg

Posted by Anjali Norwood <an...@netflix.com.INVALID>.

Hello All,
I answered all the comments and made changes to the spec to reflect them ..
the doc now shows the new revision (link here:
https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
).
Please diff with the version named 'Spec v1' in order to see the deltas.
Please let me know if any of your comments are not
satisfactorily addressed.

@Ryan, Thank you for the insightful feedback, especially around the use of
Iceberg data types for schema and possible evolution.
I agree with your comment: *"I think we should move forward with the
proposal as it is, and pursue type annotations and possibly IR in
parallel."*
How do we achieve consensus/farm for dissent and move forward with the
proposal?

thanks,
Anjali.



On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jz...@apache.org> wrote:

> +1
>
> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> Thanks for working on this, Anjali! It’s great to see the thorough
>> discussion here with everyone.
>>
>> For the discussion about SQL dialect, I think that the right first step
>> is to capture the SQL or query dialect. That will give us the most
>> flexibility. Engines that can use Coral for translation can attempt to
>> convert and engines that don’t can see if the SQL is valid and can be used.
>>
>> I think that the idea to create a minimal IR is an interesting one, but
>> can be added later. We will always need to record the SQL and dialect, even
>> if we translate to IR because users are going to configure views using SQL.
>> Uses like showing view history or debugging need to show the original SQL,
>> plus relevant information like where it was created and the SQL dialect. We
>> should be able to add this later by adding additional metadata to the view
>> definition. I don’t think that it would introduce breaking changes to add a
>> common representation that can be optionally consumed.
>>
>> Let’s continue talking about a minimal IR, separately. View translation
>> is a hard problem. Right now, to get views across engines we have to
>> compromise confidence. IR is a way to have strong confidence, but with
>> limited expressibility. I think that’s a good trade in a lot of cases and
>> is worth pursuing, even if it will take a long time.
>>
>> Jacques makes a great point about types, but I think that the right
>> option here is to continue using Iceberg types. We’ve already had
>> discussions about whether Iceberg should support annotating types with
>> engine-specific ones, so we have a reasonable way to improve this while
>> also providing compatibility across engines: char(n) is not necessarily
>> supported everywhere and mapping it to string will make sense in most
>> places. The schema is primarily used to validate that the data produced by
>> the query hasn’t changed and that is more about the number of columns in
>> structs and the names of fields rather than exact types. We can fix up
>> types when substituting without losing too much: if the SQL produces a
>> varchar(10) field that the view metadata says is a string, then it’s
>> okay that it is varchar(10). There is some loss in that we don’t know if
>> it was originally varchar(5), but I think that this is not going to
>> cause too many issues. Not all engines will even validate that the schema
>> has not changed, since it could be valid to use select * from x where ...
>> and allow new fields to appear.
>>
>> Right now, I think we should move forward with the proposal as it is, and
>> pursue type annotations and possibly IR in parallel. Does that sound
>> reasonable to everyone?
>>
>> Ryan
>>
>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi Anjali,
>>>
>>> That's a nice summary.
>>>
>>> re dialect field. It shouldn't be a bit trouble to have it (or any other
>>> way to identify application that created the view), and it might be useful.
>>> Why not make it required from the start?
>>>
>>> re "expanded/resolved SQL" -- i don't understand yet what we would put
>>> there, so cannot comment.
>>>
>>> I agree there it's nice to get something out of the door, and I see how
>>> the current proposal fits some needs already.
>>> However, i am concerned about the proliferation of non-cross-engine
>>> compatible views, if we do that.
>>>
>>> Also, if we later agree on any compatible approach (portable subset of
>>> SQL, engine-agnostic IR, etc.), then from the perspective of each engine,
>>> it would be a breaking change.
>>> Unless we make the compatible approach as expressive as full power of
>>> SQL, some views that are possible to create in v1 will not be possible to
>>> create in v2.
>>> Thus, if v1  is "some SQL" and v2 is "something awesomely compatible",
>>> we may not be able to roll it out.
>>>
>>> > the convention of common SQL has been working for a majority of users.
>>> SQL features commonly used are column projections, simple filter
>>> application, joins, grouping and common aggregate and scalar function. A
>>> few users occasionally would like to use Trino or Spark specific functions
>>> but are sometimes able to find a way to use a function that is common to
>>> both the engines.
>>>
>>>
>>> it's an awesome summary of what constructs are necessary to be able to
>>> define useful views, while also keep them portable.
>>>
>>> To be able to express column projections, simple filter application,
>>> joins, grouping and common aggregate and scalar function in a structured
>>> IR, how much effort do you think would be required?
>>> We didn't really talk about downsides of a structured approach, other
>>> than it looks complex.
>>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>>> that. Maybe i were overly optimistic though.
>>>
>>>
>>> As Jack mentioned, for engine-specific approach that's not supposed to
>>> be consumed by multiple engines, we may be better served with approach
>>> that's outside of Iceberg spec, like
>>> https://github.com/trinodb/trino/pull/8540.
>>>
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>> <an...@netflix.com.invalid> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thank you for all the comments. I will try to address them all here
>>>> together.
>>>>
>>>>
>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>>>    We can all agree that all of these options are non-trivial to
>>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>>    and merit further discussion. I would like to suggest that we continue this
>>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>    where the views are either accessed by a single engine or multi-engine use
>>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>>    desirable features such as versioning of views and a common format of
>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>    anyone feel strongly otherwise?*
>>>>    - @Piotr  As for common views at Netflix, the restrictions on SQL
>>>>    are not enforced, but are advised as best practices. The convention of
>>>>    common SQL has been working for a majority of users. SQL features commonly
>>>>    used are column projections, simple filter application, joins, grouping and
>>>>    common aggregate and scalar function. A few users occasionally would like
>>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>>    way to use a function that is common to both the engines.
>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>    hence were picked for storing view schema. Thinking further, the schema
>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>    Spark does not need it and Trino uses it only for validation).
>>>>    - @Jacques Table references in the views can be arbitrary objects
>>>>    such as tables from other catalogs or elasticsearch tables etc. I will
>>>>    clarify it in the spec.
>>>>
>>>> I will work on incorporating all the comments in the spec and make the
>>>> next revision available for review soon.
>>>>
>>>> Regards,
>>>> Anjali.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>> piotr@starburstdata.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>
>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>> I.e. it would be useful if there was some external app (or a human
>>>>> being) that would write those alternative SQLs for each dialect.
>>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino, or
>>>>> Trino writing SQL for Trino and Spark.
>>>>>
>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>>> (it's really well written and resembles spec quite well), you made a good
>>>>> point that grammar is only part of the problem. Coercions, function
>>>>> resolution, dereference resolution, table resolution are part of query
>>>>> analysis that goes beyond just grammar.
>>>>> In fact, column scoping rules -- while clearly defined by the spec --
>>>>> may easily differ between engines (pretty usual).
>>>>> That's why i would rather lean towards some intermediate
>>>>> representation that is *not *SQL, doesn't require parsing (is already
>>>>> structural), nor analysis (no scopes! no implicit coercions!).
>>>>> Before we embark on such a journey, it would be interesting to hear @Martin
>>>>> Traverso <ma...@starburstdata.com> 's thoughts on feasibility though.
>>>>>
>>>>>
>>>>> Best,
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>
>>>>>> Some thoughts...
>>>>>>
>>>>>>    - In general, many engines want (or may require) a resolved sql
>>>>>>    field. This--at minimum--typically includes star expansion since
>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>>    logical definition even if the underlying table changes). This may also
>>>>>>    include a replacement of relative object names to absolute object names
>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>>    both of these things.
>>>>>>    - It isn't clear in the spec whether the table references used in
>>>>>>    views are restricted to other Iceberg objects or can be arbitrary objects
>>>>>>    in the context of a particular engine. Maybe I missed this? For example,
>>>>>>    can I have a Trino engine view that references an Elasticsearch table
>>>>>>    stored in an Iceberg view?
>>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>>    unintended consequences. I appreciate the attraction to it but I think it
>>>>>>    would either create artificial barriers around the types of SQL that are
>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>    - It is important to remember that Calcite is a set of libraries
>>>>>>    and not a specification. There are things that can be specified in Calcite
>>>>>>    but in general it doesn't have formal specification as a first principle.
>>>>>>    It is more implementation as a first principle. This is in contrast to
>>>>>>    projects like Arrow and Iceberg, which start with well-formed
>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>>    fully resolved/comprehended.
>>>>>>
>>>>>> Given the above, I suggest having a field which describes the
>>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>> of those two domains.)
>>>>>>
>>>>>> Some other notes:
>>>>>>
>>>>>>    - Calcite does provide a nice reference document [3] but it is
>>>>>>    not sufficient to implement what is necessary for
>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>    - Projects like Coral [4] are interesting here but even Coral is
>>>>>>    based roughly on "HiveQL" which also doesn't have a formal specification
>>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>>    Coral slack [5]
>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>    closer to specification based [7] than Calcite but is much less popular in
>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness closely,
>>>>>>    a strength of Calcite.
>>>>>>    - One of the other problems with building against an
>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>>    how painful this kind of mistake can become.
>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>    actually implements that in its entirety and it has far too many places
>>>>>>    where things are "implementation-defined" [9].
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>> [2]
>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>> [4] https://github.com/linkedin/coral
>>>>>> [5]
>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>> [6] https://github.com/google/zetasql
>>>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>> [8]
>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>
>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>
>>>>>>> I would personally prefer cross-engine interoperability as a goal
>>>>>>> and get the spec in the right structure in the initial release, because:
>>>>>>>
>>>>>>> 1. I believe that cross-engine compatibility is a critical feature
>>>>>>> of Iceberg. If I am a user of an existing data lake that already supports
>>>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. I
>>>>>>> can do what is now done for Trino to use views with Iceberg. I can also
>>>>>>> just use a table property to indicate the table is a view and store the
>>>>>>> view SQL as a table property and do my own thing in any query engine to
>>>>>>> support all the view features. One of the most valuable and unique features
>>>>>>> that Iceberg view can unlock is to allow a view to be created in one engine
>>>>>>> and read by another. Not supporting cross-engine compatibility feels like
>>>>>>> losing a lot of value to me.
>>>>>>>
>>>>>>> 2. In the view definition, it feels inconsistent to me that we have
>>>>>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL that
>>>>>>> can come from any query engine. If the engine already needs to convert the
>>>>>>> view schema to iceberg shema, it should just do the same for the view SQL.
>>>>>>>
>>>>>>> Regarding the way to achieve it, I think it comes to either Apache
>>>>>>> Calcite (or some other third party alternative I don't know) or our own
>>>>>>> implementation of some intermediate representation. I don't have a very
>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>
>>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>
>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is to
>>>>>>> have a native lightweight logical query structure representation of the
>>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>> extend on that to support this use case. Different engines can build this
>>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>>
>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the engine
>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>> this way, the engine that creates the view can still read the SQL directly
>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>> representation it can fallback to when the engine's key is not found in the
>>>>>>> map. If we want to make incremental progress and delay the design for the
>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>> instead of just a single string.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> First of all thank you for this discussion and all the view-related
>>>>>>>> work!
>>>>>>>>
>>>>>>>> I agree that solving cross-engine compatibility problem may not be
>>>>>>>> primary feature today, I am concerned that not thinking about this from the
>>>>>>>> start may "tunnel" us into a wrong direction.
>>>>>>>> Cross-engine compatible views would be such a cool feature that it
>>>>>>>> is hard to just let it pass.
>>>>>>>>
>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>> being familiar enough with Calcite.
>>>>>>>> However, with new IR being focused on compatible representation,
>>>>>>>> and not being tied to anything are actually good things.
>>>>>>>> For example, we need to focus on JSON representation, but we don't
>>>>>>>> need to deal with tree traversal or anything, so the code for this could be
>>>>>>>> pretty simple.
>>>>>>>>
>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>>> across engines in the view definition (this is how currently Netflix uses
>>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>>
>>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or is
>>>>>>>> this just a convention?
>>>>>>>>
>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>> views?
>>>>>>>> Being able to enumerate them could help validate various approaches
>>>>>>>> considered here, including feasibility of dedicated representation.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Anjali,
>>>>>>>>>
>>>>>>>>> I am definitely happy to help with implementing 1-3 in your first
>>>>>>>>> list once the spec has been approved by the community. My hope is that the
>>>>>>>>> final version of the view spec will make it easy to re-use existing
>>>>>>>>> rollback/time travel/metadata etc functionalities.
>>>>>>>>>
>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>
>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have
>>>>>>>>>> and will continue to address them in the doc.
>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>
>>>>>>>>>> Thanks to everybody who has offered help with implementation. The
>>>>>>>>>> spec as it is proposed in the doc has been implemented and is in use at
>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will rebase
>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>
>>>>>>>>>> A few areas that we have not yet worked on and would love for the
>>>>>>>>>> community to help are:
>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>> version or time
>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>> similar to the $snapshots table in order to display the history of a view
>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a previous
>>>>>>>>>> version
>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>
>>>>>>>>>> One comment that is worth a broader discussion is the dialect of
>>>>>>>>>> the SQL stored in the view metadata. The purpose of the spec is to provide
>>>>>>>>>> a storage format for view metadata and APIs to access that metadata. The
>>>>>>>>>> dialect of the SQL stored is an orthogonal question and is outside the
>>>>>>>>>> scope of this spec.
>>>>>>>>>>
>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>> common across engines in the view definition (this is how currently Netflix
>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of
>>>>>>>>>> the SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>>>>>> operability
>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>> 4. Intermediate structured language of our own. (What additional
>>>>>>>>>> functionality does it provide over Calcite?)
>>>>>>>>>>
>>>>>>>>>> Given that the view metadata is json, it is easily extendable to
>>>>>>>>>> incorporate any new fields needed to make the SQL truly compatible across
>>>>>>>>>> engines.
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> regards,
>>>>>>>>>> Anjali
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>
>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>> between query engines.
>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>
>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>> I treat it as a great technology supporting transitioning from a
>>>>>>>>>>> query engine to a better one.
>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>
>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>> This would eliminate need for cross-engine compatible parsing
>>>>>>>>>>> and analysis.
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>> PF
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <ry...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>
>>>>>>>>>>>> I have left some comments on the document. I unfortunately have
>>>>>>>>>>>> to miss the community meetup tomorrow but would love to chat more/help w/
>>>>>>>>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec for
>>>>>>>>>>>>> storing view metadata in Iceberg. The proposal has been implemented [1] and
>>>>>>>>>>>>> is in production at Netflix for over 15 months.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
> --
> John Zhuge
>

Re: Proposal: Support for views in Iceberg

Posted by John Zhuge <jz...@apache.org>.

+1

On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <bl...@tabular.io> wrote:

> Thanks for working on this, Anjali! It’s great to see the thorough
> discussion here with everyone.
>
> For the discussion about SQL dialect, I think that the right first step is
> to capture the SQL or query dialect. That will give us the most
> flexibility. Engines that can use Coral for translation can attempt to
> convert and engines that don’t can see if the SQL is valid and can be used.
>
> I think that the idea to create a minimal IR is an interesting one, but
> can be added later. We will always need to record the SQL and dialect, even
> if we translate to IR because users are going to configure views using SQL.
> Uses like showing view history or debugging need to show the original SQL,
> plus relevant information like where it was created and the SQL dialect. We
> should be able to add this later by adding additional metadata to the view
> definition. I don’t think that it would introduce breaking changes to add a
> common representation that can be optionally consumed.
>
> Let’s continue talking about a minimal IR, separately. View translation is
> a hard problem. Right now, to get views across engines we have to
> compromise confidence. IR is a way to have strong confidence, but with
> limited expressibility. I think that’s a good trade in a lot of cases and
> is worth pursuing, even if it will take a long time.
>
> Jacques makes a great point about types, but I think that the right option
> here is to continue using Iceberg types. We’ve already had discussions
> about whether Iceberg should support annotating types with engine-specific
> ones, so we have a reasonable way to improve this while also providing
> compatibility across engines: char(n) is not necessarily supported
> everywhere and mapping it to string will make sense in most places. The
> schema is primarily used to validate that the data produced by the query
> hasn’t changed and that is more about the number of columns in structs and
> the names of fields rather than exact types. We can fix up types when
> substituting without losing too much: if the SQL produces a varchar(10)
> field that the view metadata says is a string, then it’s okay that it is
> varchar(10). There is some loss in that we don’t know if it was
> originally varchar(5), but I think that this is not going to cause too
> many issues. Not all engines will even validate that the schema has not
> changed, since it could be valid to use select * from x where ... and
> allow new fields to appear.
>
> Right now, I think we should move forward with the proposal as it is, and
> pursue type annotations and possibly IR in parallel. Does that sound
> reasonable to everyone?
>
> Ryan
>
> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi Anjali,
>>
>> That's a nice summary.
>>
>> re dialect field. It shouldn't be a bit trouble to have it (or any other
>> way to identify application that created the view), and it might be useful.
>> Why not make it required from the start?
>>
>> re "expanded/resolved SQL" -- i don't understand yet what we would put
>> there, so cannot comment.
>>
>> I agree there it's nice to get something out of the door, and I see how
>> the current proposal fits some needs already.
>> However, i am concerned about the proliferation of non-cross-engine
>> compatible views, if we do that.
>>
>> Also, if we later agree on any compatible approach (portable subset of
>> SQL, engine-agnostic IR, etc.), then from the perspective of each engine,
>> it would be a breaking change.
>> Unless we make the compatible approach as expressive as full power of
>> SQL, some views that are possible to create in v1 will not be possible to
>> create in v2.
>> Thus, if v1  is "some SQL" and v2 is "something awesomely compatible", we
>> may not be able to roll it out.
>>
>> > the convention of common SQL has been working for a majority of users.
>> SQL features commonly used are column projections, simple filter
>> application, joins, grouping and common aggregate and scalar function. A
>> few users occasionally would like to use Trino or Spark specific functions
>> but are sometimes able to find a way to use a function that is common to
>> both the engines.
>>
>>
>> it's an awesome summary of what constructs are necessary to be able to
>> define useful views, while also keep them portable.
>>
>> To be able to express column projections, simple filter application,
>> joins, grouping and common aggregate and scalar function in a structured
>> IR, how much effort do you think would be required?
>> We didn't really talk about downsides of a structured approach, other
>> than it looks complex.
>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>> that. Maybe i were overly optimistic though.
>>
>>
>> As Jack mentioned, for engine-specific approach that's not supposed to be
>> consumed by multiple engines, we may be better served with approach that's
>> outside of Iceberg spec, like https://github.com/trinodb/trino/pull/8540.
>>
>>
>> Best,
>> PF
>>
>>
>>
>>
>>
>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>> <an...@netflix.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> Thank you for all the comments. I will try to address them all here
>>> together.
>>>
>>>
>>>    - @all Cross engine compatibility of view definition: Multiple
>>>    options such as engine agnostic SQL or IR of some form have been mentioned.
>>>    We can all agree that all of these options are non-trivial to
>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>    and merit further discussion. I would like to suggest that we continue this
>>>    discussion but target this work for the future (v2?). In v1, we can add an
>>>    optional dialect field and an optional expanded/resolved SQL field that can
>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>    where the views are either accessed by a single engine or multi-engine use
>>>    cases where a (common) subset of SQL is supported. This proposal allows for
>>>    desirable features such as versioning of views and a common format of
>>>    storing view metadata while allowing extensibility in the future. *Does
>>>    anyone feel strongly otherwise?*
>>>    - @Piotr  As for common views at Netflix, the restrictions on SQL
>>>    are not enforced, but are advised as best practices. The convention of
>>>    common SQL has been working for a majority of users. SQL features commonly
>>>    used are column projections, simple filter application, joins, grouping and
>>>    common aggregate and scalar function. A few users occasionally would like
>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>    way to use a function that is common to both the engines.
>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>    hence were picked for storing view schema. Thinking further, the schema
>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>    Spark does not need it and Trino uses it only for validation).
>>>    - @Jacques Table references in the views can be arbitrary objects
>>>    such as tables from other catalogs or elasticsearch tables etc. I will
>>>    clarify it in the spec.
>>>
>>> I will work on incorporating all the comments in the spec and make the
>>> next revision available for review soon.
>>>
>>> Regards,
>>> Anjali.
>>>
>>>
>>>
>>>
>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>
>>>> I agree that tracking  dialect/origin is better than nothing.
>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>> I.e. it would be useful if there was some external app (or a human
>>>> being) that would write those alternative SQLs for each dialect.
>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino, or
>>>> Trino writing SQL for Trino and Spark.
>>>>
>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>> (it's really well written and resembles spec quite well), you made a good
>>>> point that grammar is only part of the problem. Coercions, function
>>>> resolution, dereference resolution, table resolution are part of query
>>>> analysis that goes beyond just grammar.
>>>> In fact, column scoping rules -- while clearly defined by the spec --
>>>> may easily differ between engines (pretty usual).
>>>> That's why i would rather lean towards some intermediate representation
>>>> that is *not *SQL, doesn't require parsing (is already structural),
>>>> nor analysis (no scopes! no implicit coercions!).
>>>> Before we embark on such a journey, it would be interesting to hear @Martin
>>>> Traverso <ma...@starburstdata.com> 's thoughts on feasibility though.
>>>>
>>>>
>>>> Best,
>>>> PF
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Some thoughts...
>>>>>
>>>>>    - In general, many engines want (or may require) a resolved sql
>>>>>    field. This--at minimum--typically includes star expansion since
>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>    logical definition even if the underlying table changes). This may also
>>>>>    include a replacement of relative object names to absolute object names
>>>>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>>>>    both of these things.
>>>>>    - It isn't clear in the spec whether the table references used in
>>>>>    views are restricted to other Iceberg objects or can be arbitrary objects
>>>>>    in the context of a particular engine. Maybe I missed this? For example,
>>>>>    can I have a Trino engine view that references an Elasticsearch table
>>>>>    stored in an Iceberg view?
>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>    unintended consequences. I appreciate the attraction to it but I think it
>>>>>    would either create artificial barriers around the types of SQL that are
>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>    change the behavior of the query which I believe violates most typical
>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>    context, is this view disallowed? If it isn't disallowed then you have an
>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>    - It is important to remember that Calcite is a set of libraries
>>>>>    and not a specification. There are things that can be specified in Calcite
>>>>>    but in general it doesn't have formal specification as a first principle.
>>>>>    It is more implementation as a first principle. This is in contrast to
>>>>>    projects like Arrow and Iceberg, which start with well-formed
>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>    manipulation, etc.  If anything, I'd suggest we start with the adoption of
>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>    and/or Trino [2] use. Even that is not a complete specification as grammar
>>>>>    must still be interpreted with regards to type promotion, function
>>>>>    resolution, consistent unnamed expression naming, etc that aren't defined
>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC grammar
>>>>>    as it heavily embeds implementation details (in a good way) and relies on
>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>    fully resolved/comprehended.
>>>>>
>>>>> Given the above, I suggest having a field which describes the
>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>> to consume/mutate that view (and whether they want to or not). It does risk
>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>> (Realistically, there is far more to specify here than there is in either
>>>>> of those two domains.)
>>>>>
>>>>> Some other notes:
>>>>>
>>>>>    - Calcite does provide a nice reference document [3] but it is not
>>>>>    sufficient to implement what is necessary for parsing/validating/resolving
>>>>>    a SQL string correctly/consistently.
>>>>>    - Projects like Coral [4] are interesting here but even Coral is
>>>>>    based roughly on "HiveQL" which also doesn't have a formal specification
>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>    Coral slack [5]
>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>    closer to specification based [7] than Calcite but is much less popular in
>>>>>    the big data domain. I also haven't reviewed it's SQL completeness closely,
>>>>>    a strength of Calcite.
>>>>>    - One of the other problems with building against an
>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can make
>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec [8] for
>>>>>    how painful this kind of mistake can become.
>>>>>    - I'd love to use the SQL specification itself but nobody actually
>>>>>    implements that in its entirety and it has far too many places where things
>>>>>    are "implementation-defined" [9].
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>> [2]
>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>> [4] https://github.com/linkedin/coral
>>>>> [5] https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>> [6] https://github.com/google/zetasql
>>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>> [8]
>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>
>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> Did not notice that we are also discussing cross-engine
>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>
>>>>>> I would personally prefer cross-engine interoperability as a goal and
>>>>>> get the spec in the right structure in the initial release, because:
>>>>>>
>>>>>> 1. I believe that cross-engine compatibility is a critical feature of
>>>>>> Iceberg. If I am a user of an existing data lake that already supports
>>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. I
>>>>>> can do what is now done for Trino to use views with Iceberg. I can also
>>>>>> just use a table property to indicate the table is a view and store the
>>>>>> view SQL as a table property and do my own thing in any query engine to
>>>>>> support all the view features. One of the most valuable and unique features
>>>>>> that Iceberg view can unlock is to allow a view to be created in one engine
>>>>>> and read by another. Not supporting cross-engine compatibility feels like
>>>>>> losing a lot of value to me.
>>>>>>
>>>>>> 2. In the view definition, it feels inconsistent to me that we have
>>>>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL that
>>>>>> can come from any query engine. If the engine already needs to convert the
>>>>>> view schema to iceberg shema, it should just do the same for the view SQL.
>>>>>>
>>>>>> Regarding the way to achieve it, I think it comes to either Apache
>>>>>> Calcite (or some other third party alternative I don't know) or our own
>>>>>> implementation of some intermediate representation. I don't have a very
>>>>>> strong opinion, but my thoughts are the following:
>>>>>>
>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>> definitely going to be much more involved, and it will become another
>>>>>> barrier for newer engines to onboard because it not only needs to implement
>>>>>> Iceberg APIs but also needs Calcite support. It will also start to become a
>>>>>> constant discussion around what we maintain and what we should push to
>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>
>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is to
>>>>>> have a native lightweight logical query structure representation of the
>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>> extend on that to support this use case. Different engines can build this
>>>>>> logical structure when traversing their own AST during a create view query.
>>>>>>
>>>>>> 3. With these considerations, I think the "sql" field can potentially
>>>>>> be a map (maybe called "engine-sqls"?), where key is the engine type and
>>>>>> version like "Spark 3.1", and value is the view SQL string. In this way,
>>>>>> the engine that creates the view can still read the SQL directly which
>>>>>> might lead to better engine-native integration and avoid redundant parsing.
>>>>>> But in this approach there is always a default intermediate representation
>>>>>> it can fallback to when the engine's key is not found in the map. If we
>>>>>> want to make incremental progress and delay the design for the intermediate
>>>>>> representation, I think we should at least use this map instead of just a
>>>>>> single string.
>>>>>>
>>>>>> Thanks,
>>>>>> Jack Ye
>>>>>>
>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>> piotr@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> First of all thank you for this discussion and all the view-related
>>>>>>> work!
>>>>>>>
>>>>>>> I agree that solving cross-engine compatibility problem may not be
>>>>>>> primary feature today, I am concerned that not thinking about this from the
>>>>>>> start may "tunnel" us into a wrong direction.
>>>>>>> Cross-engine compatible views would be such a cool feature that it
>>>>>>> is hard to just let it pass.
>>>>>>>
>>>>>>> My thinking about a smaller IR may be a side-product of me not being
>>>>>>> familiar enough with Calcite.
>>>>>>> However, with new IR being focused on compatible representation, and
>>>>>>> not being tied to anything are actually good things.
>>>>>>> For example, we need to focus on JSON representation, but we don't
>>>>>>> need to deal with tree traversal or anything, so the code for this could be
>>>>>>> pretty simple.
>>>>>>>
>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>> across engines in the view definition (this is how currently Netflix uses
>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>
>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or is
>>>>>>> this just a convention?
>>>>>>>
>>>>>>> What are the common building blocks (relational operations,
>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>> views?
>>>>>>> Being able to enumerate them could help validate various approaches
>>>>>>> considered here, including feasibility of dedicated representation.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <ry...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Anjali,
>>>>>>>>
>>>>>>>> I am definitely happy to help with implementing 1-3 in your first
>>>>>>>> list once the spec has been approved by the community. My hope is that the
>>>>>>>> final version of the view spec will make it easy to re-use existing
>>>>>>>> rollback/time travel/metadata etc functionalities.
>>>>>>>>
>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>> desirable while storing Calcite makes it hard for eg python to use views. A
>>>>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>> engines that views are not (yet) cross compatible. This is at odds with the
>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>
>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have
>>>>>>>>> and will continue to address them in the doc.
>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>
>>>>>>>>> Thanks to everybody who has offered help with implementation. The
>>>>>>>>> spec as it is proposed in the doc has been implemented and is in use at
>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will rebase
>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>
>>>>>>>>> A few areas that we have not yet worked on and would love for the
>>>>>>>>> community to help are:
>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>> version or time
>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>> similar to the $snapshots table in order to display the history of a view
>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a previous
>>>>>>>>> version
>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>
>>>>>>>>> One comment that is worth a broader discussion is the dialect of
>>>>>>>>> the SQL stored in the view metadata. The purpose of the spec is to provide
>>>>>>>>> a storage format for view metadata and APIs to access that metadata. The
>>>>>>>>> dialect of the SQL stored is an orthogonal question and is outside the
>>>>>>>>> scope of this spec.
>>>>>>>>>
>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>>>> across engines in the view definition (this is how currently Netflix uses
>>>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of the
>>>>>>>>> SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>>>>> operability
>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>> 4. Intermediate structured language of our own. (What additional
>>>>>>>>> functionality does it provide over Calcite?)
>>>>>>>>>
>>>>>>>>> Given that the view metadata is json, it is easily extendable to
>>>>>>>>> incorporate any new fields needed to make the SQL truly compatible across
>>>>>>>>> engines.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Anjali
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>
>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>> between query engines.
>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>
>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>> I treat it as a great technology supporting transitioning from a
>>>>>>>>>> query engine to a better one.
>>>>>>>>>> However, I would not base a design of some new system for storing
>>>>>>>>>> cross-engine compatible views on that.
>>>>>>>>>>
>>>>>>>>>> Is there something else we can use?
>>>>>>>>>> Maybe the view definition should use some intermediate structured
>>>>>>>>>> language that's not SQL?
>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>> operations in semantics manner.
>>>>>>>>>> This would eliminate need for cross-engine compatible parsing and
>>>>>>>>>> analysis.
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <ry...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>
>>>>>>>>>>> I have left some comments on the document. I unfortunately have
>>>>>>>>>>> to miss the community meetup tomorrow but would love to chat more/help w/
>>>>>>>>>>> implementation.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>> <an...@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> John Zhuge and I would like to propose the following spec for
>>>>>>>>>>>> storing view metadata in Iceberg. The proposal has been implemented [1] and
>>>>>>>>>>>> is in production at Netflix for over 15 months.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>
>>>>>>>>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>
>>>>>>>>>>>
>
> --
> Ryan Blue
> Tabular
>
-- 
John Zhuge