You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Sam Redai <sa...@tabular.io> on 2022/01/15 02:21:41 UTC

Meeting Minutes from 01/05 Iceberg Sync

Hey Everyone!

Here are the minutes and video recording from our Iceberg Sync that took
place on December 5th, 9am-10am PT.  A quick reminder that since the
previous sync was pushed forward one week, we have a shorter window this
time and the next sync is this coming week on 01/19 at 9am PT. If you have
any highlights or agenda items, don't forget to include them in the live doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
.

As always, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
anyone who is seeking an invite. The notes and the agenda are posted
in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation.

Minutes:

Meeting Recording ⭕
<https://drive.google.com/file/d/1o-GQg0ER1Jco9RC1ayiXZi4tf4w1BUV_/view?ts=61d71e9c>

Top of the Meeting Highlights

   -

   Lock Manager support for HadoopCatalog: The lock manager functionality
   was recently added by Jack Ye. Nan has added support for this in the
   HadoopCatalog!
   -

   Expiration for the Cache Manager: This functionality has been added by
   Kyle and is now configurable. This can deal with situations where Spark
   would keep things around for a while.
   -

   NOT_STARTS_WITH Operator: This is a significant addition by Kyle and
   allows Iceberg to handle instances where Spark negates a STARTS WITH.
   -

   Time Traveling w/ Schema Changes now Fixed: Wing Yew made an update to
   allow time traveling to also include changing to the schema that was used
   at the time. (Spark 2.4 to Spark 3.2)
   -

   GCSFileIO: This was recently added by Dan. Also, thanks to Kyle for
   testing it out!
   -

   Spark vectorized reads with equality deletes: Yufei has this added and
   working!
   -

   DELETE, UPDATE, MERGE in Spark 3.2: The work is continuing for the
   copy-on-write plans for Spark 3.2 and the 0.13.0 release will soon
   unblocked. Thanks Anton!
   -

   Rewrite data files stored procedure: Allows you to select portions of
   the table to consider for rewrites. This was recently added by Ajantha!


Upcoming 0.13.0 Release

   -

   Bugfixes getting in very quickly
   -

   Spark 3.2 support should be in by the end of the week to unblock this
   (waiting on MERGE support)
   -

   Release candidate is expected in ~1 week.


MergeOnRead feature

   -

   Anton is working on this currently.
   -

   Support for this required moving all of the plans from the optimizer in
   Spark into the analyzer. (Pretty significant change to how plans work in
   Iceberg’s SQL extensions, and this is a Spark 3.2 change only)


   -

   DELETE FROM is in an approved PR and should be merged soon

Tagging and branching

   -

   The work on adding this to the java implementation is underway and the
   more eyes on this the better
   -

   This is a good time for considerations on how we identify branches and
   tags in select queries in various engines
   -

   How should branch history be used?
   -

      More useful if time-traveling in a branch uses the history of that
      branch instead of the current main branch
      -

      This would require an update to the spec
      -

   The proposed spec defines `min-snapshots-to-keep` and
   `max-snapshot-age-ms` as the default for all branches, and then allow that
   to be overridden for particular branches.

Delete read optimization

   -

   Support for vectorized readers for positional deletes and equality
   deletes has been merged in
   -

   For non-vectorized reads, some memory optimizations are pending: PR #3535
   <https://github.com/apache/iceberg/pull/3535>

REST Catalog

   -

   There’s been a lot of discussions and the rest catalog spec is coming
   together!
   -

   The open API spec for namespace operations should be ready to merge soon
   (create namespace, drop namespace, create namespace property, etc.)
   -

   One of the goals here is to have a standardized API to enable more
   flexible implementations that don’t require users to load a runtime jar.
   Other goals include better conflict detection and handling cases where an
   old writer drops refs.
   -

   The REST catalog should also enable light wrapping of an existing
   catalog implementation, i.e. JDBC, to expose a language-independent
   interface with an existing catalog. (Although we most likely will not
   include such a service as part of the open-source Iceberg project)

Parquet/ORC Bloom Filter Support

   -

   There have been discussions in the past for taking advantage of bloom
   filters available in the Parquet and ORC file formats
   -

   Would most likely require reasonable configuration at the table level
   (correctly configuring the filter may in fact be the most challenging part
   of implementing this)
   -

   It’s possible that additional complexity exists on the write-side when
   factoring in schema evolution

Potential Spark support for streaming change data feeds (out of Iceberg
tables)

   -

   Flink is currently implemented using pre-update, post-update, delete,
   and insert and we should be able to do the same thing with Spark by adding
   a reader or mode that uses that as the schema.
   -

   An alternative to log segments is to read the previous snapshot and the
   current snapshot and calculate the diff live. That’s made much easier with
   merge-on-read.
   -

   Calculating the diff live would have the challenge of determining which
   record in the previous snapshot corresponds to an updated record in the
   current snapshot


Have a great weekend!

Re: Meeting Minutes from 01/05 Iceberg Sync

Posted by Sam Redai <sa...@tabular.io>.

Quick correction. These are notes for the sync that took place on *January
5th*, 9am-10am PT.

-Sam

On Fri, Jan 14, 2022 at 6:21 PM Sam Redai <sa...@tabular.io> wrote:

> Hey Everyone!
>
> Here are the minutes and video recording from our Iceberg Sync that took
> place on December 5th, 9am-10am PT.  A quick reminder that since the
> previous sync was pushed forward one week, we have a shorter window this
> time and the next sync is this coming week on 01/19 at 9am PT. If you have
> any highlights or agenda items, don't forget to include them in the live
> doc
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
> .
>
> As always, anyone can join the discussion so feel free to share the
> Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
> anyone who is seeking an invite. The notes and the agenda are posted in the live
> doc
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> that's
> also attached to the meeting invitation.
>
> Minutes:
>
> Meeting Recording ⭕
> <https://drive.google.com/file/d/1o-GQg0ER1Jco9RC1ayiXZi4tf4w1BUV_/view?ts=61d71e9c>
>
> Top of the Meeting Highlights
>
>    -
>
>    Lock Manager support for HadoopCatalog: The lock manager functionality
>    was recently added by Jack Ye. Nan has added support for this in the
>    HadoopCatalog!
>    -
>
>    Expiration for the Cache Manager: This functionality has been added by
>    Kyle and is now configurable. This can deal with situations where Spark
>    would keep things around for a while.
>    -
>
>    NOT_STARTS_WITH Operator: This is a significant addition by Kyle and
>    allows Iceberg to handle instances where Spark negates a STARTS WITH.
>    -
>
>    Time Traveling w/ Schema Changes now Fixed: Wing Yew made an update to
>    allow time traveling to also include changing to the schema that was used
>    at the time. (Spark 2.4 to Spark 3.2)
>    -
>
>    GCSFileIO: This was recently added by Dan. Also, thanks to Kyle for
>    testing it out!
>    -
>
>    Spark vectorized reads with equality deletes: Yufei has this added and
>    working!
>    -
>
>    DELETE, UPDATE, MERGE in Spark 3.2: The work is continuing for the
>    copy-on-write plans for Spark 3.2 and the 0.13.0 release will soon
>    unblocked. Thanks Anton!
>    -
>
>    Rewrite data files stored procedure: Allows you to select portions of
>    the table to consider for rewrites. This was recently added by Ajantha!
>
>
> Upcoming 0.13.0 Release
>
>    -
>
>    Bugfixes getting in very quickly
>    -
>
>    Spark 3.2 support should be in by the end of the week to unblock this
>    (waiting on MERGE support)
>    -
>
>    Release candidate is expected in ~1 week.
>
>
> MergeOnRead feature
>
>    -
>
>    Anton is working on this currently.
>    -
>
>    Support for this required moving all of the plans from the optimizer
>    in Spark into the analyzer. (Pretty significant change to how plans work in
>    Iceberg’s SQL extensions, and this is a Spark 3.2 change only)
>
>
>    -
>
>    DELETE FROM is in an approved PR and should be merged soon
>
> Tagging and branching
>
>    -
>
>    The work on adding this to the java implementation is underway and the
>    more eyes on this the better
>    -
>
>    This is a good time for considerations on how we identify branches and
>    tags in select queries in various engines
>    -
>
>    How should branch history be used?
>    -
>
>       More useful if time-traveling in a branch uses the history of that
>       branch instead of the current main branch
>       -
>
>       This would require an update to the spec
>       -
>
>    The proposed spec defines `min-snapshots-to-keep` and
>    `max-snapshot-age-ms` as the default for all branches, and then allow that
>    to be overridden for particular branches.
>
> Delete read optimization
>
>    -
>
>    Support for vectorized readers for positional deletes and equality
>    deletes has been merged in
>    -
>
>    For non-vectorized reads, some memory optimizations are pending: PR #
>    3535 <https://github.com/apache/iceberg/pull/3535>
>
> REST Catalog
>
>    -
>
>    There’s been a lot of discussions and the rest catalog spec is coming
>    together!
>    -
>
>    The open API spec for namespace operations should be ready to merge
>    soon (create namespace, drop namespace, create namespace property, etc.)
>    -
>
>    One of the goals here is to have a standardized API to enable more
>    flexible implementations that don’t require users to load a runtime jar.
>    Other goals include better conflict detection and handling cases where an
>    old writer drops refs.
>    -
>
>    The REST catalog should also enable light wrapping of an existing
>    catalog implementation, i.e. JDBC, to expose a language-independent
>    interface with an existing catalog. (Although we most likely will not
>    include such a service as part of the open-source Iceberg project)
>
> Parquet/ORC Bloom Filter Support
>
>    -
>
>    There have been discussions in the past for taking advantage of bloom
>    filters available in the Parquet and ORC file formats
>    -
>
>    Would most likely require reasonable configuration at the table level
>    (correctly configuring the filter may in fact be the most challenging part
>    of implementing this)
>    -
>
>    It’s possible that additional complexity exists on the write-side when
>    factoring in schema evolution
>
> Potential Spark support for streaming change data feeds (out of Iceberg
> tables)
>
>    -
>
>    Flink is currently implemented using pre-update, post-update, delete,
>    and insert and we should be able to do the same thing with Spark by adding
>    a reader or mode that uses that as the schema.
>    -
>
>    An alternative to log segments is to read the previous snapshot and
>    the current snapshot and calculate the diff live. That’s made much easier
>    with merge-on-read.
>    -
>
>    Calculating the diff live would have the challenge of determining
>    which record in the previous snapshot corresponds to an updated record in
>    the current snapshot
>
>
> Have a great weekend!
>