You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Sam Redai <sa...@tabular.io> on 2022/01/19 20:27:50 UTC

Meeting Minutes from 01/19 Iceberg Sync

Hello Iceberg Community,

Below you can find the minutes and video recording from our Iceberg Sync
that took place on *January 19th, 9am-10am PT*.

Always remember, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
anyone who is seeking an invite. The notes and the agenda are posted
in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation and it's a good place to add items
as you see fit so we can discuss them in the next community sync.

Minutes:

Meeting Recording ⭕
<https://drive.google.com/file/d/1adbE5ichvfM3_-v2mcH16k1NMiCNjW1O/view>

Top of the Meeting Highlights

   -

   Spark 3.2 merge-on-read DELETE is in! (Thanks, Anton!)
   -

   Added dynamic file filter metrics in Spark 3.0 and 3.1 (Thanks, Chen!)
   -

      Iceberg-spark metrics are now visible in the Spark UI
      -

         Candidate files that were scanned
         -

         Matching files that went on into subsequent stages such as merge
         or update plans
         -

         This is useful to check out as an example of how to easily add
         metrics to some of our custom plans and have them show up in
the Spark UI
         -

   Added initial OpenAPI spec for REST catalogs (Thanks, Kyle!)
   -

      The spec for the REST catalog has been shared and lots of community
      members have been looking closely at it.
      -

      This will have a big impact, particularly on SaaS implementations, so
      the more feedback from the community on this the better.

0.13.0 Release Candidate

   -

   Spark 3.2 merge-on-read DELETE is in! (Thanks, Anton!)
   -

   Sort order in relation to copy-on-write merge might have a few remaining
   items but it feels ready to merge now and improve in upcoming releases if
   necessary.
   -

   Parquet support for the older 2-level list style: If we could get that
   into the 0.13.0 release, that would be great although it’s not a blocker
   and we can follow up with a quick subsequent release. PR #3774
   <https://github.com/apache/iceberg/pull/3774>
   -

   Checksum validation on S3 requests is another relevant open pull
   request. This is also not a blocker for 0.13.0 but is close enough that we
   may want to include it. PR #3813
   <https://github.com/apache/iceberg/pull/3813>
   -

   Nessie-related changes for the 0.13.0 release are merged and ready.

Python

   -

   A lot of interest in the community and among our user base
   -

   Some ongoing discussions on how to get review cycles shortened
   -

   Types PR was merged yesterday and there are a couple of current PRs that
   are very near merge-ready
   -

   Open invitation to anyone interested in participating in the python
   refactoring efforts (either contributing or reviewing) so check out
the Iceberg
   Python Sync <https://groups.google.com/g/iceberg-python-sync> if you’re
   interested!

Java 1.0 API

   -

   Managing delete files is a big request for the 1.0 API. We’ve recently
   added a delete file threshold to the rewrite files action which drives how
   many delete files will remain. You can also set this to 0 to rewrite all
   delete files. This suffices for removing delete compactions from the
   remaining high priority items list.
   -

   Other potential remaining items to consider
   -

      Expiring delete files that are no longer used by the current
      snapshots for a boost to storage efficiency. This functionality could
      simply be added to the expire snapshot action.
      -

   No other high-priority maintenance operations seem to be remaining for
   the 1.0 release
   -

   More discussion is needed around what should be public/private and how
   we’ll evolve the API over time.
   -

   Currently, the target is for the next Java API release to be 1.0

Other High Priority Items:

Alternative File Formats

   -

   Ashish has been taking a look at this and it seems very doable. Current
   formats (ORC, Parquet, etc.) share very similar interfaces that can inform
   the abstraction.

Encryption

   -

   This is a high-priority item that’s in demand and can hopefully get in
   soon. There are a few PR’s open and community review is welcome.

CDC

   -

   Not necessarily high priority but there’s a strong desire to get this in
   this calendar year. Refreshing materialized views in spark is highly
   dependent on this functionality. Yufei is working on a design doc.

Tagging Snapshots and Searching Snapshots by Tag

   -

   This is useful for allowing tags to be exposed to users for easily
   retrieving a previous table state. In particular, tagging snapshots on an
   hourly or daily basis for future convenient lookups.

Z-Ordering

   -

   Significant progress has been made on Z-Ordering and one of the current
   discussions is around its implementation with respects to the magnitude
   problem. Specifically, normalizing values may require metrics/stats on
   column distributions.
   -

   In its current state, Apple’s implementation doesn’t include any stats
   on distribution but is valuable as an initial implementation to get in to
   unblock the feature and get it out to the community while solving the
   magnitude problem in a subsequent release.

REST Catalog

   -

   There’s a lot of interest in the REST catalog and there are at least a
   few areas of the community that are ready to use it immediately, i.e.
   Nessie and Apple
   -

   Future talks about pushing more work into the server implementations:
   -

      Should planning be a part of the REST catalog API?
      -

      Pagination mechanism for accessing all of the snapshots in a table?

Relative Paths (design doc
<https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit#heading=h.hxmtkjthp8hm>
)

   -

   High priority for a few community members, primarily Apple.
   -

   Many use cases for this feature across the community (surrounding data
   recovery)

Views

   -

   LinkedIn and Netflix are interested in this effort and picking it back up
   -

   A question is how we’d be able to integrate this back into Spark after
   it’s added on the Iceberg side

Secondary Indexes

   -

   A priority for the Athena team but if other members find this as a high
   priority please reach out and share details around your use cases.
   -

   Might be useful for cases where you can’t order the data as it’s written
   to the table. Also, cases where you need to index by an additional column
   that’s not included in your sort order.

Handling Wide Tables (many columns)

   -

   In such a scenario, the metadata files can get very large given that we
   store metrics for each column. The MetricsConfig is key for optimizing this.
   -

   See table configuration <https://iceberg.apache.org/configuration/> for
   more details
   -

      In particular, setting `write.metadata.metrics.default` to none or
      tuning this at the column level using `
      write.metadata.metrics.column.col1`


Thank you all for another great meeting!