You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2020/04/16 21:30:17 UTC

Iceberg community sync notes - 15 April 2020

Here are my notes from yesterday’s sync. As usual, feel free to add to this
if I missed something.

There were a couple of questions raised during the sync that we’d like to
open up to anyone who wasn’t able to attend:

   - Should we wait for the parallel metadata rewrite action before cutting
   0.8.0 candidates?
   - Should we wait for ORC metrics before cutting 0.8.0 candidates?

In the sync, we thought that it would be good to wait and get these in.
Please reply to this if you agree or disagree.

Thanks!

*Attendees*:

   - Ryan Blue
   - Dan Weeks
   - Anjali Norwood
   - Jun Ma
   - Ratandeep Ratti
   - Pavan
   - Christine Mathiesen
   - Gautam Kowshik
   - Mass Dosage
   - Filip
   - Ryan Murray

*Topics*:

   - 0.8.0 release blockers: actions, ORC metrics
   - Row-level delete update
   - Parquet vectorized read update
   - InputFormats and Hive support
   - Netflix branch

*Discussion*:

   - 0.8.0 release
      - Ryan: we planned to get a candidate out this week, but I think we
      may want to wait on 2 things that are about ready
      - First: Anton is contributing an action to rewrite manifests in
      parallel that is close. Anyone interested? (Gautam is interested)
      - Second: ORC is passing correctness tests, but doesn’t have
      column-level metrics. Should we wait for this?
      - Ratandeep: ORC also lacks predicate push-down support
      - Ryan: I think metrics are more important than PPD because PPD is
      task side and metrics help reduce the number of tasks. If we wait on one,
      I’d prefer to wait on metrics
      - Ratandeep will look into whether he or Shardul can work on this
      - General consensus was to wait for these features before getting a
      candidate out
   - Row-level deletes
      - Good progress in several PRs on adding the parallel v2 write path,
      as Owen suggested last sync
      - Junjie contributed an update to the spec for file/position delete
      files
   - Parquet vectorized read
      - Dan: flat schema reads are primarily waiting on reviews
      - Dan: is anyone interested in complex type support?
      - Gautam needs struct and map support. 0.14.0 doesn’t support maps
      - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but not
      maps of structs
      - Ryan (Blue): Because we have a translation layer in Iceberg to pass
      off to Spark, we don’t actually need support in Arrow. We are currently
      stuck on 0.14.0 because of changes that prevent us from avoiding a null
      check (see this comment
      <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
      )
   -

   InputFormat and Hive support
   - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig
      and Hive; need to start working on the serde side and building a Hive
      StorageHandler, missing DDL support
      - Ryan: What DDL support?
      -

      Ratandeep: Statements like ADD PARTITION
      -

      Ryan: How would all of this work in Hive? It isn’t clear what
      components are needed right now: StorageHandler? RawStore? HiveMetaHook?
      - Ratandeep: Currently working on only the read path, which requires
      a StorageHandler. The write path would be more difficult.
      - Mass Dosage: Working on a (mapred) InputFormat for Hive in
      iceberg-mr, started working on a serde in iceberg-hive. Interested in
      writes, but not in the short or medium term
      - Mass Dosage: The main problem is dependency conflicts between Hive
      and Iceberg, mainly Guava
      - Ryan: Anyone know a good replacement for Guava collections?
      - Ryan: In Avro, we have a module that shades Guava
      <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
      and has a class with references
      <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
      Then shading can minimize the shaded classes. We could do that here.
      - Ryan: Is Jackson also a problem?
      - Mass Dosage: Yes, and calcite
      - Ryan: Calcite probably isn’t referenced directly so we can
      hopefully avoid the consistent versions problem by excluding it
   - Netflix branch of Iceberg (with non-Iceberg additions)
      - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
      branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
      for Spark 2.4 with DSv2 backported <https://github.com/Netflix/spark>
      - The purpose of this is to share non-Iceberg work that we use to
      compliment Iceberg, like views, catalogs, and DSv2 tables
      - Views are SQL views
      <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
      that are stored and versioned like Iceberg metadata. This is how we are
      tracking views for Presto and Spark (coral integration would be
nice!). We
      are contributing the Spark DSv2 ViewCatalog to upstream Spark
      - Metacat is an open metastore project from Netflix. The metacat
      package contains our metastore integration
      <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
      for it.
      - The batch package contains Spark and Hive table implementations for
      Spark’s DSv2
      <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
      which we use for multi-catalog support.
   - Gautam: how will migration to Iceberg’s v2 format work for those of us
   in production using v1?
      - Ryan: Tables are explicitly updated to v2 and both v1 and v2 will
      be supported in parallel. Using v1 until everything is updated with v2
      support takes care of forward-compatibility issues. This can be done on a
      per-table basis
      - Gautam: Does migration require rewriting metadata?
      - Ryan: No, the format is backward compatible with v1, so the update
      is metadata-only until the writers start using new metadata that v1 would
      ignore (deletes) and would incorrectly modify if it were to write to v2.
      - Ryan: Also, Iceberg already has a forward-compatibility check that
      will prevent v1 readers from loading a v2 table.

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg community sync notes - 15 April 2020

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

> The views from Netflix branch is a great feature, would have any plan to
port to Apache Iceberg

I think we'd be willing to contribute the view support to the Apache
project if everyone thinks it is a good idea for Iceberg to handle views.

I don't want feature creep to cause the project to be difficult to
maintain, but I think it does make some sense to add views since we already
have so much code that could be shared for interacting with metastores.

On Fri, Apr 17, 2020 at 9:27 AM Mass Dosage <ma...@gmail.com> wrote:

> Cool. I've raised a draft PR for the approach we discussed on the call:
>
> https://github.com/apache/incubator-iceberg/pull/935/files
>
> It's incomplete but I've put some notes explaining that, would be nice to
> know what others think of the above approach and if they have better ideas.
>
> Another approach that we did successfully was to shade and relocate Guava
> in every Iceberg subproject that used it, that way you can depend on it
> "normally" but the build file is pretty messy with shadow jar versions of
> everything etc. I can raise a WIP PR for that approach to compare if anyone
> is interested.
>
> Thanks,
>
> Adrian
>
>
>
> On Fri, 17 Apr 2020 at 15:58, RD <rd...@gmail.com> wrote:
>
>> Thanks for the Correction Adrian.  I've filed the ticket for github here:
>> https://github.com/apache/incubator-iceberg/issues/934 . There are 2
>> approaches mentioned there with pros/cons. Will be good to get the
>> community's feedback on how to proceed.
>>
>> -best,
>> R.
>>
>> On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage <ma...@gmail.com> wrote:
>>
>>> Thanks for the detailed notes Ryan. My thoughts on a few of the topics...
>>>
>>> 0.8.0 release - my general preference is to release early and release
>>> often. If features aren't ready why wait? Why not go with a 0.8.0 release
>>> now and then a 0.9.0 (or whatever) a couple of weeks later with the other
>>> features? I know with Apache projects this can sometimes be a challenge
>>> with all the ceremony around a release, getting votes etc. but I don't
>>> think that's such a problem in the incubating stage?
>>>
>>> A clarification on the InputFomats - I think the DDL Ratandeep was
>>> referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
>>> i.e. the "read" path but for statements other than "SELECT" etc. Also, to
>>> be clear the `mapreduce` InputFormat that was contributed - it sounds like
>>> that works for Pig but I don't think it will work for Hive 1 or 2 since
>>> they use the `mapred` API for InputFormats. This is what we have attempted
>>> to cover in our InputFormat. I raised a WIP PR for it yesterday at
>>> https://github.com/apache/incubator-iceberg/pull/933 and would
>>> appreciate feedback from anyone interested in it.
>>>
>>> Thanks for sharing the Avro hack for shading and relocating Guava.
>>> Should I create a ticket on GitHub to capture this work? We'll then have a
>>> go at implementing it.
>>>
>>> Thanks,
>>>
>>> Adrian
>>>
>>>
>>> On Fri, 17 Apr 2020 at 04:07, OpenInx <op...@gmail.com> wrote:
>>>
>>>> Thanks for the writing.
>>>> The views from Netflix branch is a great feature, would have any plan
>>>> to port to Apache Iceberg ?
>>>>
>>>> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>>>>> this if I missed something.
>>>>>
>>>>> There were a couple of questions raised during the sync that we’d like
>>>>> to open up to anyone who wasn’t able to attend:
>>>>>
>>>>>    - Should we wait for the parallel metadata rewrite action before
>>>>>    cutting 0.8.0 candidates?
>>>>>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>>>>
>>>>> In the sync, we thought that it would be good to wait and get these
>>>>> in. Please reply to this if you agree or disagree.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> *Attendees*:
>>>>>
>>>>>    - Ryan Blue
>>>>>    - Dan Weeks
>>>>>    - Anjali Norwood
>>>>>    - Jun Ma
>>>>>    - Ratandeep Ratti
>>>>>    - Pavan
>>>>>    - Christine Mathiesen
>>>>>    - Gautam Kowshik
>>>>>    - Mass Dosage
>>>>>    - Filip
>>>>>    - Ryan Murray
>>>>>
>>>>> *Topics*:
>>>>>
>>>>>    - 0.8.0 release blockers: actions, ORC metrics
>>>>>    - Row-level delete update
>>>>>    - Parquet vectorized read update
>>>>>    - InputFormats and Hive support
>>>>>    - Netflix branch
>>>>>
>>>>> *Discussion*:
>>>>>
>>>>>    - 0.8.0 release
>>>>>       - Ryan: we planned to get a candidate out this week, but I
>>>>>       think we may want to wait on 2 things that are about ready
>>>>>       - First: Anton is contributing an action to rewrite manifests
>>>>>       in parallel that is close. Anyone interested? (Gautam is interested)
>>>>>       - Second: ORC is passing correctness tests, but doesn’t have
>>>>>       column-level metrics. Should we wait for this?
>>>>>       - Ratandeep: ORC also lacks predicate push-down support
>>>>>       - Ryan: I think metrics are more important than PPD because PPD
>>>>>       is task side and metrics help reduce the number of tasks. If we wait on
>>>>>       one, I’d prefer to wait on metrics
>>>>>       - Ratandeep will look into whether he or Shardul can work on
>>>>>       this
>>>>>       - General consensus was to wait for these features before
>>>>>       getting a candidate out
>>>>>    - Row-level deletes
>>>>>       - Good progress in several PRs on adding the parallel v2 write
>>>>>       path, as Owen suggested last sync
>>>>>       - Junjie contributed an update to the spec for file/position
>>>>>       delete files
>>>>>    - Parquet vectorized read
>>>>>       - Dan: flat schema reads are primarily waiting on reviews
>>>>>       - Dan: is anyone interested in complex type support?
>>>>>       - Gautam needs struct and map support. 0.14.0 doesn’t support
>>>>>       maps
>>>>>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>>>>       not maps of structs
>>>>>       - Ryan (Blue): Because we have a translation layer in Iceberg
>>>>>       to pass off to Spark, we don’t actually need support in Arrow. We are
>>>>>       currently stuck on 0.14.0 because of changes that prevent us from avoiding
>>>>>       a null check (see this comment
>>>>>       <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>>>>       )
>>>>>    -
>>>>>
>>>>>    InputFormat and Hive support
>>>>>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for
>>>>>       Pig and Hive; need to start working on the serde side and building a Hive
>>>>>       StorageHandler, missing DDL support
>>>>>       - Ryan: What DDL support?
>>>>>       -
>>>>>
>>>>>       Ratandeep: Statements like ADD PARTITION
>>>>>       -
>>>>>
>>>>>       Ryan: How would all of this work in Hive? It isn’t clear what
>>>>>       components are needed right now: StorageHandler? RawStore? HiveMetaHook?
>>>>>       - Ratandeep: Currently working on only the read path, which
>>>>>       requires a StorageHandler. The write path would be more difficult.
>>>>>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>>>>>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>>>>>       writes, but not in the short or medium term
>>>>>       - Mass Dosage: The main problem is dependency conflicts between
>>>>>       Hive and Iceberg, mainly Guava
>>>>>       - Ryan: Anyone know a good replacement for Guava collections?
>>>>>       - Ryan: In Avro, we have a module that shades Guava
>>>>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>>>>>       and has a class with references
>>>>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>>>>>       Then shading can minimize the shaded classes. We could do that here.
>>>>>       - Ryan: Is Jackson also a problem?
>>>>>       - Mass Dosage: Yes, and calcite
>>>>>       - Ryan: Calcite probably isn’t referenced directly so we can
>>>>>       hopefully avoid the consistent versions problem by excluding it
>>>>>    - Netflix branch of Iceberg (with non-Iceberg additions)
>>>>>       - Ryan: We’ve published a copy of our current Iceberg
>>>>>       0.7.0-based branch
>>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4> for
>>>>>       Spark 2.4 with DSv2 backported
>>>>>       <https://github.com/Netflix/spark>
>>>>>       - The purpose of this is to share non-Iceberg work that we use
>>>>>       to compliment Iceberg, like views, catalogs, and DSv2 tables
>>>>>       - Views are SQL views
>>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>>>>>       that are stored and versioned like Iceberg metadata. This is how we are
>>>>>       tracking views for Presto and Spark (coral integration would be nice!). We
>>>>>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>>>>>       - Metacat is an open metastore project from Netflix. The metacat
>>>>>       package contains our metastore integration
>>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>>>>>       for it.
>>>>>       - The batch package contains Spark and Hive table
>>>>>       implementations for Spark’s DSv2
>>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>>>>>       which we use for multi-catalog support.
>>>>>    - Gautam: how will migration to Iceberg’s v2 format work for those
>>>>>    of us in production using v1?
>>>>>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2
>>>>>       will be supported in parallel. Using v1 until everything is updated with v2
>>>>>       support takes care of forward-compatibility issues. This can be done on a
>>>>>       per-table basis
>>>>>       - Gautam: Does migration require rewriting metadata?
>>>>>       - Ryan: No, the format is backward compatible with v1, so the
>>>>>       update is metadata-only until the writers start using new metadata that v1
>>>>>       would ignore (deletes) and would incorrectly modify if it were to write to
>>>>>       v2.
>>>>>       - Ryan: Also, Iceberg already has a forward-compatibility check
>>>>>       that will prevent v1 readers from loading a v2 table.
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg community sync notes - 15 April 2020

Posted by Mass Dosage <ma...@gmail.com>.

Cool. I've raised a draft PR for the approach we discussed on the call:

https://github.com/apache/incubator-iceberg/pull/935/files

It's incomplete but I've put some notes explaining that, would be nice to
know what others think of the above approach and if they have better ideas.

Another approach that we did successfully was to shade and relocate Guava
in every Iceberg subproject that used it, that way you can depend on it
"normally" but the build file is pretty messy with shadow jar versions of
everything etc. I can raise a WIP PR for that approach to compare if anyone
is interested.

Thanks,

Adrian



On Fri, 17 Apr 2020 at 15:58, RD <rd...@gmail.com> wrote:

> Thanks for the Correction Adrian.  I've filed the ticket for github here:
> https://github.com/apache/incubator-iceberg/issues/934 . There are 2
> approaches mentioned there with pros/cons. Will be good to get the
> community's feedback on how to proceed.
>
> -best,
> R.
>
> On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage <ma...@gmail.com> wrote:
>
>> Thanks for the detailed notes Ryan. My thoughts on a few of the topics...
>>
>> 0.8.0 release - my general preference is to release early and release
>> often. If features aren't ready why wait? Why not go with a 0.8.0 release
>> now and then a 0.9.0 (or whatever) a couple of weeks later with the other
>> features? I know with Apache projects this can sometimes be a challenge
>> with all the ceremony around a release, getting votes etc. but I don't
>> think that's such a problem in the incubating stage?
>>
>> A clarification on the InputFomats - I think the DDL Ratandeep was
>> referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
>> i.e. the "read" path but for statements other than "SELECT" etc. Also, to
>> be clear the `mapreduce` InputFormat that was contributed - it sounds like
>> that works for Pig but I don't think it will work for Hive 1 or 2 since
>> they use the `mapred` API for InputFormats. This is what we have attempted
>> to cover in our InputFormat. I raised a WIP PR for it yesterday at
>> https://github.com/apache/incubator-iceberg/pull/933 and would
>> appreciate feedback from anyone interested in it.
>>
>> Thanks for sharing the Avro hack for shading and relocating Guava. Should
>> I create a ticket on GitHub to capture this work? We'll then have a go at
>> implementing it.
>>
>> Thanks,
>>
>> Adrian
>>
>>
>> On Fri, 17 Apr 2020 at 04:07, OpenInx <op...@gmail.com> wrote:
>>
>>> Thanks for the writing.
>>> The views from Netflix branch is a great feature, would have any plan to
>>> port to Apache Iceberg ?
>>>
>>> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>>>> this if I missed something.
>>>>
>>>> There were a couple of questions raised during the sync that we’d like
>>>> to open up to anyone who wasn’t able to attend:
>>>>
>>>>    - Should we wait for the parallel metadata rewrite action before
>>>>    cutting 0.8.0 candidates?
>>>>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>>>
>>>> In the sync, we thought that it would be good to wait and get these in.
>>>> Please reply to this if you agree or disagree.
>>>>
>>>> Thanks!
>>>>
>>>> *Attendees*:
>>>>
>>>>    - Ryan Blue
>>>>    - Dan Weeks
>>>>    - Anjali Norwood
>>>>    - Jun Ma
>>>>    - Ratandeep Ratti
>>>>    - Pavan
>>>>    - Christine Mathiesen
>>>>    - Gautam Kowshik
>>>>    - Mass Dosage
>>>>    - Filip
>>>>    - Ryan Murray
>>>>
>>>> *Topics*:
>>>>
>>>>    - 0.8.0 release blockers: actions, ORC metrics
>>>>    - Row-level delete update
>>>>    - Parquet vectorized read update
>>>>    - InputFormats and Hive support
>>>>    - Netflix branch
>>>>
>>>> *Discussion*:
>>>>
>>>>    - 0.8.0 release
>>>>       - Ryan: we planned to get a candidate out this week, but I think
>>>>       we may want to wait on 2 things that are about ready
>>>>       - First: Anton is contributing an action to rewrite manifests in
>>>>       parallel that is close. Anyone interested? (Gautam is interested)
>>>>       - Second: ORC is passing correctness tests, but doesn’t have
>>>>       column-level metrics. Should we wait for this?
>>>>       - Ratandeep: ORC also lacks predicate push-down support
>>>>       - Ryan: I think metrics are more important than PPD because PPD
>>>>       is task side and metrics help reduce the number of tasks. If we wait on
>>>>       one, I’d prefer to wait on metrics
>>>>       - Ratandeep will look into whether he or Shardul can work on this
>>>>       - General consensus was to wait for these features before
>>>>       getting a candidate out
>>>>    - Row-level deletes
>>>>       - Good progress in several PRs on adding the parallel v2 write
>>>>       path, as Owen suggested last sync
>>>>       - Junjie contributed an update to the spec for file/position
>>>>       delete files
>>>>    - Parquet vectorized read
>>>>       - Dan: flat schema reads are primarily waiting on reviews
>>>>       - Dan: is anyone interested in complex type support?
>>>>       - Gautam needs struct and map support. 0.14.0 doesn’t support
>>>>       maps
>>>>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>>>       not maps of structs
>>>>       - Ryan (Blue): Because we have a translation layer in Iceberg to
>>>>       pass off to Spark, we don’t actually need support in Arrow. We are
>>>>       currently stuck on 0.14.0 because of changes that prevent us from avoiding
>>>>       a null check (see this comment
>>>>       <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>>>       )
>>>>    -
>>>>
>>>>    InputFormat and Hive support
>>>>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for
>>>>       Pig and Hive; need to start working on the serde side and building a Hive
>>>>       StorageHandler, missing DDL support
>>>>       - Ryan: What DDL support?
>>>>       -
>>>>
>>>>       Ratandeep: Statements like ADD PARTITION
>>>>       -
>>>>
>>>>       Ryan: How would all of this work in Hive? It isn’t clear what
>>>>       components are needed right now: StorageHandler? RawStore? HiveMetaHook?
>>>>       - Ratandeep: Currently working on only the read path, which
>>>>       requires a StorageHandler. The write path would be more difficult.
>>>>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>>>>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>>>>       writes, but not in the short or medium term
>>>>       - Mass Dosage: The main problem is dependency conflicts between
>>>>       Hive and Iceberg, mainly Guava
>>>>       - Ryan: Anyone know a good replacement for Guava collections?
>>>>       - Ryan: In Avro, we have a module that shades Guava
>>>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>>>>       and has a class with references
>>>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>>>>       Then shading can minimize the shaded classes. We could do that here.
>>>>       - Ryan: Is Jackson also a problem?
>>>>       - Mass Dosage: Yes, and calcite
>>>>       - Ryan: Calcite probably isn’t referenced directly so we can
>>>>       hopefully avoid the consistent versions problem by excluding it
>>>>    - Netflix branch of Iceberg (with non-Iceberg additions)
>>>>       - Ryan: We’ve published a copy of our current Iceberg
>>>>       0.7.0-based branch
>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4> for
>>>>       Spark 2.4 with DSv2 backported <https://github.com/Netflix/spark>
>>>>       - The purpose of this is to share non-Iceberg work that we use
>>>>       to compliment Iceberg, like views, catalogs, and DSv2 tables
>>>>       - Views are SQL views
>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>>>>       that are stored and versioned like Iceberg metadata. This is how we are
>>>>       tracking views for Presto and Spark (coral integration would be nice!). We
>>>>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>>>>       - Metacat is an open metastore project from Netflix. The metacat
>>>>       package contains our metastore integration
>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>>>>       for it.
>>>>       - The batch package contains Spark and Hive table
>>>>       implementations for Spark’s DSv2
>>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>>>>       which we use for multi-catalog support.
>>>>    - Gautam: how will migration to Iceberg’s v2 format work for those
>>>>    of us in production using v1?
>>>>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2
>>>>       will be supported in parallel. Using v1 until everything is updated with v2
>>>>       support takes care of forward-compatibility issues. This can be done on a
>>>>       per-table basis
>>>>       - Gautam: Does migration require rewriting metadata?
>>>>       - Ryan: No, the format is backward compatible with v1, so the
>>>>       update is metadata-only until the writers start using new metadata that v1
>>>>       would ignore (deletes) and would incorrectly modify if it were to write to
>>>>       v2.
>>>>       - Ryan: Also, Iceberg already has a forward-compatibility check
>>>>       that will prevent v1 readers from loading a v2 table.
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

Re: Iceberg community sync notes - 15 April 2020

Posted by RD <rd...@gmail.com>.

Thanks for the Correction Adrian.  I've filed the ticket for github here:
https://github.com/apache/incubator-iceberg/issues/934 . There are 2
approaches mentioned there with pros/cons. Will be good to get the
community's feedback on how to proceed.

-best,
R.

On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage <ma...@gmail.com> wrote:

> Thanks for the detailed notes Ryan. My thoughts on a few of the topics...
>
> 0.8.0 release - my general preference is to release early and release
> often. If features aren't ready why wait? Why not go with a 0.8.0 release
> now and then a 0.9.0 (or whatever) a couple of weeks later with the other
> features? I know with Apache projects this can sometimes be a challenge
> with all the ceremony around a release, getting votes etc. but I don't
> think that's such a problem in the incubating stage?
>
> A clarification on the InputFomats - I think the DDL Ratandeep was
> referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
> i.e. the "read" path but for statements other than "SELECT" etc. Also, to
> be clear the `mapreduce` InputFormat that was contributed - it sounds like
> that works for Pig but I don't think it will work for Hive 1 or 2 since
> they use the `mapred` API for InputFormats. This is what we have attempted
> to cover in our InputFormat. I raised a WIP PR for it yesterday at
> https://github.com/apache/incubator-iceberg/pull/933 and would appreciate
> feedback from anyone interested in it.
>
> Thanks for sharing the Avro hack for shading and relocating Guava. Should
> I create a ticket on GitHub to capture this work? We'll then have a go at
> implementing it.
>
> Thanks,
>
> Adrian
>
>
> On Fri, 17 Apr 2020 at 04:07, OpenInx <op...@gmail.com> wrote:
>
>> Thanks for the writing.
>> The views from Netflix branch is a great feature, would have any plan to
>> port to Apache Iceberg ?
>>
>> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>>> this if I missed something.
>>>
>>> There were a couple of questions raised during the sync that we’d like
>>> to open up to anyone who wasn’t able to attend:
>>>
>>>    - Should we wait for the parallel metadata rewrite action before
>>>    cutting 0.8.0 candidates?
>>>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>>
>>> In the sync, we thought that it would be good to wait and get these in.
>>> Please reply to this if you agree or disagree.
>>>
>>> Thanks!
>>>
>>> *Attendees*:
>>>
>>>    - Ryan Blue
>>>    - Dan Weeks
>>>    - Anjali Norwood
>>>    - Jun Ma
>>>    - Ratandeep Ratti
>>>    - Pavan
>>>    - Christine Mathiesen
>>>    - Gautam Kowshik
>>>    - Mass Dosage
>>>    - Filip
>>>    - Ryan Murray
>>>
>>> *Topics*:
>>>
>>>    - 0.8.0 release blockers: actions, ORC metrics
>>>    - Row-level delete update
>>>    - Parquet vectorized read update
>>>    - InputFormats and Hive support
>>>    - Netflix branch
>>>
>>> *Discussion*:
>>>
>>>    - 0.8.0 release
>>>       - Ryan: we planned to get a candidate out this week, but I think
>>>       we may want to wait on 2 things that are about ready
>>>       - First: Anton is contributing an action to rewrite manifests in
>>>       parallel that is close. Anyone interested? (Gautam is interested)
>>>       - Second: ORC is passing correctness tests, but doesn’t have
>>>       column-level metrics. Should we wait for this?
>>>       - Ratandeep: ORC also lacks predicate push-down support
>>>       - Ryan: I think metrics are more important than PPD because PPD
>>>       is task side and metrics help reduce the number of tasks. If we wait on
>>>       one, I’d prefer to wait on metrics
>>>       - Ratandeep will look into whether he or Shardul can work on this
>>>       - General consensus was to wait for these features before getting
>>>       a candidate out
>>>    - Row-level deletes
>>>       - Good progress in several PRs on adding the parallel v2 write
>>>       path, as Owen suggested last sync
>>>       - Junjie contributed an update to the spec for file/position
>>>       delete files
>>>    - Parquet vectorized read
>>>       - Dan: flat schema reads are primarily waiting on reviews
>>>       - Dan: is anyone interested in complex type support?
>>>       - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>>>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>>       not maps of structs
>>>       - Ryan (Blue): Because we have a translation layer in Iceberg to
>>>       pass off to Spark, we don’t actually need support in Arrow. We are
>>>       currently stuck on 0.14.0 because of changes that prevent us from avoiding
>>>       a null check (see this comment
>>>       <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>>       )
>>>    -
>>>
>>>    InputFormat and Hive support
>>>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for
>>>       Pig and Hive; need to start working on the serde side and building a Hive
>>>       StorageHandler, missing DDL support
>>>       - Ryan: What DDL support?
>>>       -
>>>
>>>       Ratandeep: Statements like ADD PARTITION
>>>       -
>>>
>>>       Ryan: How would all of this work in Hive? It isn’t clear what
>>>       components are needed right now: StorageHandler? RawStore? HiveMetaHook?
>>>       - Ratandeep: Currently working on only the read path, which
>>>       requires a StorageHandler. The write path would be more difficult.
>>>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>>>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>>>       writes, but not in the short or medium term
>>>       - Mass Dosage: The main problem is dependency conflicts between
>>>       Hive and Iceberg, mainly Guava
>>>       - Ryan: Anyone know a good replacement for Guava collections?
>>>       - Ryan: In Avro, we have a module that shades Guava
>>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>>>       and has a class with references
>>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>>>       Then shading can minimize the shaded classes. We could do that here.
>>>       - Ryan: Is Jackson also a problem?
>>>       - Mass Dosage: Yes, and calcite
>>>       - Ryan: Calcite probably isn’t referenced directly so we can
>>>       hopefully avoid the consistent versions problem by excluding it
>>>    - Netflix branch of Iceberg (with non-Iceberg additions)
>>>       - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
>>>       branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
>>>       for Spark 2.4 with DSv2 backported
>>>       <https://github.com/Netflix/spark>
>>>       - The purpose of this is to share non-Iceberg work that we use to
>>>       compliment Iceberg, like views, catalogs, and DSv2 tables
>>>       - Views are SQL views
>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>>>       that are stored and versioned like Iceberg metadata. This is how we are
>>>       tracking views for Presto and Spark (coral integration would be nice!). We
>>>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>>>       - Metacat is an open metastore project from Netflix. The metacat
>>>       package contains our metastore integration
>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>>>       for it.
>>>       - The batch package contains Spark and Hive table implementations
>>>       for Spark’s DSv2
>>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>>>       which we use for multi-catalog support.
>>>    - Gautam: how will migration to Iceberg’s v2 format work for those
>>>    of us in production using v1?
>>>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2
>>>       will be supported in parallel. Using v1 until everything is updated with v2
>>>       support takes care of forward-compatibility issues. This can be done on a
>>>       per-table basis
>>>       - Gautam: Does migration require rewriting metadata?
>>>       - Ryan: No, the format is backward compatible with v1, so the
>>>       update is metadata-only until the writers start using new metadata that v1
>>>       would ignore (deletes) and would incorrectly modify if it were to write to
>>>       v2.
>>>       - Ryan: Also, Iceberg already has a forward-compatibility check
>>>       that will prevent v1 readers from loading a v2 table.
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: Iceberg community sync notes - 15 April 2020

Posted by Mass Dosage <ma...@gmail.com>.

Thanks for the detailed notes Ryan. My thoughts on a few of the topics...

0.8.0 release - my general preference is to release early and release
often. If features aren't ready why wait? Why not go with a 0.8.0 release
now and then a 0.9.0 (or whatever) a couple of weeks later with the other
features? I know with Apache projects this can sometimes be a challenge
with all the ceremony around a release, getting votes etc. but I don't
think that's such a problem in the incubating stage?

A clarification on the InputFomats - I think the DDL Ratandeep was
referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
i.e. the "read" path but for statements other than "SELECT" etc. Also, to
be clear the `mapreduce` InputFormat that was contributed - it sounds like
that works for Pig but I don't think it will work for Hive 1 or 2 since
they use the `mapred` API for InputFormats. This is what we have attempted
to cover in our InputFormat. I raised a WIP PR for it yesterday at
https://github.com/apache/incubator-iceberg/pull/933 and would appreciate
feedback from anyone interested in it.

Thanks for sharing the Avro hack for shading and relocating Guava. Should I
create a ticket on GitHub to capture this work? We'll then have a go at
implementing it.

Thanks,

Adrian


On Fri, 17 Apr 2020 at 04:07, OpenInx <op...@gmail.com> wrote:

> Thanks for the writing.
> The views from Netflix branch is a great feature, would have any plan to
> port to Apache Iceberg ?
>
> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>> this if I missed something.
>>
>> There were a couple of questions raised during the sync that we’d like to
>> open up to anyone who wasn’t able to attend:
>>
>>    - Should we wait for the parallel metadata rewrite action before
>>    cutting 0.8.0 candidates?
>>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>
>> In the sync, we thought that it would be good to wait and get these in.
>> Please reply to this if you agree or disagree.
>>
>> Thanks!
>>
>> *Attendees*:
>>
>>    - Ryan Blue
>>    - Dan Weeks
>>    - Anjali Norwood
>>    - Jun Ma
>>    - Ratandeep Ratti
>>    - Pavan
>>    - Christine Mathiesen
>>    - Gautam Kowshik
>>    - Mass Dosage
>>    - Filip
>>    - Ryan Murray
>>
>> *Topics*:
>>
>>    - 0.8.0 release blockers: actions, ORC metrics
>>    - Row-level delete update
>>    - Parquet vectorized read update
>>    - InputFormats and Hive support
>>    - Netflix branch
>>
>> *Discussion*:
>>
>>    - 0.8.0 release
>>       - Ryan: we planned to get a candidate out this week, but I think
>>       we may want to wait on 2 things that are about ready
>>       - First: Anton is contributing an action to rewrite manifests in
>>       parallel that is close. Anyone interested? (Gautam is interested)
>>       - Second: ORC is passing correctness tests, but doesn’t have
>>       column-level metrics. Should we wait for this?
>>       - Ratandeep: ORC also lacks predicate push-down support
>>       - Ryan: I think metrics are more important than PPD because PPD is
>>       task side and metrics help reduce the number of tasks. If we wait on one,
>>       I’d prefer to wait on metrics
>>       - Ratandeep will look into whether he or Shardul can work on this
>>       - General consensus was to wait for these features before getting
>>       a candidate out
>>    - Row-level deletes
>>       - Good progress in several PRs on adding the parallel v2 write
>>       path, as Owen suggested last sync
>>       - Junjie contributed an update to the spec for file/position
>>       delete files
>>    - Parquet vectorized read
>>       - Dan: flat schema reads are primarily waiting on reviews
>>       - Dan: is anyone interested in complex type support?
>>       - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>       not maps of structs
>>       - Ryan (Blue): Because we have a translation layer in Iceberg to
>>       pass off to Spark, we don’t actually need support in Arrow. We are
>>       currently stuck on 0.14.0 because of changes that prevent us from avoiding
>>       a null check (see this comment
>>       <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>       )
>>    -
>>
>>    InputFormat and Hive support
>>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig
>>       and Hive; need to start working on the serde side and building a Hive
>>       StorageHandler, missing DDL support
>>       - Ryan: What DDL support?
>>       -
>>
>>       Ratandeep: Statements like ADD PARTITION
>>       -
>>
>>       Ryan: How would all of this work in Hive? It isn’t clear what
>>       components are needed right now: StorageHandler? RawStore? HiveMetaHook?
>>       - Ratandeep: Currently working on only the read path, which
>>       requires a StorageHandler. The write path would be more difficult.
>>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>>       writes, but not in the short or medium term
>>       - Mass Dosage: The main problem is dependency conflicts between
>>       Hive and Iceberg, mainly Guava
>>       - Ryan: Anyone know a good replacement for Guava collections?
>>       - Ryan: In Avro, we have a module that shades Guava
>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>>       and has a class with references
>>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>>       Then shading can minimize the shaded classes. We could do that here.
>>       - Ryan: Is Jackson also a problem?
>>       - Mass Dosage: Yes, and calcite
>>       - Ryan: Calcite probably isn’t referenced directly so we can
>>       hopefully avoid the consistent versions problem by excluding it
>>    - Netflix branch of Iceberg (with non-Iceberg additions)
>>       - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
>>       branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
>>       for Spark 2.4 with DSv2 backported
>>       <https://github.com/Netflix/spark>
>>       - The purpose of this is to share non-Iceberg work that we use to
>>       compliment Iceberg, like views, catalogs, and DSv2 tables
>>       - Views are SQL views
>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>>       that are stored and versioned like Iceberg metadata. This is how we are
>>       tracking views for Presto and Spark (coral integration would be nice!). We
>>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>>       - Metacat is an open metastore project from Netflix. The metacat
>>       package contains our metastore integration
>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>>       for it.
>>       - The batch package contains Spark and Hive table implementations
>>       for Spark’s DSv2
>>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>>       which we use for multi-catalog support.
>>    - Gautam: how will migration to Iceberg’s v2 format work for those of
>>    us in production using v1?
>>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2
>>       will be supported in parallel. Using v1 until everything is updated with v2
>>       support takes care of forward-compatibility issues. This can be done on a
>>       per-table basis
>>       - Gautam: Does migration require rewriting metadata?
>>       - Ryan: No, the format is backward compatible with v1, so the
>>       update is metadata-only until the writers start using new metadata that v1
>>       would ignore (deletes) and would incorrectly modify if it were to write to
>>       v2.
>>       - Ryan: Also, Iceberg already has a forward-compatibility check
>>       that will prevent v1 readers from loading a v2 table.
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Iceberg community sync notes - 15 April 2020

Posted by OpenInx <op...@gmail.com>.

Thanks for the writing.
The views from Netflix branch is a great feature, would have any plan to
port to Apache Iceberg ?

On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Here are my notes from yesterday’s sync. As usual, feel free to add to
> this if I missed something.
>
> There were a couple of questions raised during the sync that we’d like to
> open up to anyone who wasn’t able to attend:
>
>    - Should we wait for the parallel metadata rewrite action before
>    cutting 0.8.0 candidates?
>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>
> In the sync, we thought that it would be good to wait and get these in.
> Please reply to this if you agree or disagree.
>
> Thanks!
>
> *Attendees*:
>
>    - Ryan Blue
>    - Dan Weeks
>    - Anjali Norwood
>    - Jun Ma
>    - Ratandeep Ratti
>    - Pavan
>    - Christine Mathiesen
>    - Gautam Kowshik
>    - Mass Dosage
>    - Filip
>    - Ryan Murray
>
> *Topics*:
>
>    - 0.8.0 release blockers: actions, ORC metrics
>    - Row-level delete update
>    - Parquet vectorized read update
>    - InputFormats and Hive support
>    - Netflix branch
>
> *Discussion*:
>
>    - 0.8.0 release
>       - Ryan: we planned to get a candidate out this week, but I think we
>       may want to wait on 2 things that are about ready
>       - First: Anton is contributing an action to rewrite manifests in
>       parallel that is close. Anyone interested? (Gautam is interested)
>       - Second: ORC is passing correctness tests, but doesn’t have
>       column-level metrics. Should we wait for this?
>       - Ratandeep: ORC also lacks predicate push-down support
>       - Ryan: I think metrics are more important than PPD because PPD is
>       task side and metrics help reduce the number of tasks. If we wait on one,
>       I’d prefer to wait on metrics
>       - Ratandeep will look into whether he or Shardul can work on this
>       - General consensus was to wait for these features before getting a
>       candidate out
>    - Row-level deletes
>       - Good progress in several PRs on adding the parallel v2 write
>       path, as Owen suggested last sync
>       - Junjie contributed an update to the spec for file/position delete
>       files
>    - Parquet vectorized read
>       - Dan: flat schema reads are primarily waiting on reviews
>       - Dan: is anyone interested in complex type support?
>       - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but not
>       maps of structs
>       - Ryan (Blue): Because we have a translation layer in Iceberg to
>       pass off to Spark, we don’t actually need support in Arrow. We are
>       currently stuck on 0.14.0 because of changes that prevent us from avoiding
>       a null check (see this comment
>       <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>       )
>    -
>
>    InputFormat and Hive support
>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig
>       and Hive; need to start working on the serde side and building a Hive
>       StorageHandler, missing DDL support
>       - Ryan: What DDL support?
>       -
>
>       Ratandeep: Statements like ADD PARTITION
>       -
>
>       Ryan: How would all of this work in Hive? It isn’t clear what
>       components are needed right now: StorageHandler? RawStore? HiveMetaHook?
>       - Ratandeep: Currently working on only the read path, which
>       requires a StorageHandler. The write path would be more difficult.
>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>       writes, but not in the short or medium term
>       - Mass Dosage: The main problem is dependency conflicts between
>       Hive and Iceberg, mainly Guava
>       - Ryan: Anyone know a good replacement for Guava collections?
>       - Ryan: In Avro, we have a module that shades Guava
>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>       and has a class with references
>       <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>       Then shading can minimize the shaded classes. We could do that here.
>       - Ryan: Is Jackson also a problem?
>       - Mass Dosage: Yes, and calcite
>       - Ryan: Calcite probably isn’t referenced directly so we can
>       hopefully avoid the consistent versions problem by excluding it
>    - Netflix branch of Iceberg (with non-Iceberg additions)
>       - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
>       branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
>       for Spark 2.4 with DSv2 backported
>       <https://github.com/Netflix/spark>
>       - The purpose of this is to share non-Iceberg work that we use to
>       compliment Iceberg, like views, catalogs, and DSv2 tables
>       - Views are SQL views
>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>       that are stored and versioned like Iceberg metadata. This is how we are
>       tracking views for Presto and Spark (coral integration would be nice!). We
>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>       - Metacat is an open metastore project from Netflix. The metacat
>       package contains our metastore integration
>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>       for it.
>       - The batch package contains Spark and Hive table implementations
>       for Spark’s DSv2
>       <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>       which we use for multi-catalog support.
>    - Gautam: how will migration to Iceberg’s v2 format work for those of
>    us in production using v1?
>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2 will
>       be supported in parallel. Using v1 until everything is updated with v2
>       support takes care of forward-compatibility issues. This can be done on a
>       per-table basis
>       - Gautam: Does migration require rewriting metadata?
>       - Ryan: No, the format is backward compatible with v1, so the
>       update is metadata-only until the writers start using new metadata that v1
>       would ignore (deletes) and would incorrectly modify if it were to write to
>       v2.
>       - Ryan: Also, Iceberg already has a forward-compatibility check
>       that will prevent v1 readers from loading a v2 table.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>