You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Sam Redai <sa...@tabular.io> on 2021/11/19 00:23:13 UTC
Meeting Minutes from 11/17 Iceberg Sync
Hi Everyone,
Here are the minutes and video recording from our Iceberg Sync that took
place on November 17th, 9am-10am PT. Please remember that anyone can join
the discussion so feel free to share the Iceberg-Sync
<https://groups.google.com/g/iceberg-sync> google group with anyone who is
seeking an invite. As usual, the notes and the agenda are posted in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation.
The recording has been shared with the Iceberg sync google group. If you
have any issues accessing it, please let me know!
Meeting Recording ⭕
<https://drive.google.com/file/d/1WEXy3VPgsLRIrjsMrHXVydmm4bbEdQBg/view?usp=sharing>
Top of the Meeting Highlights
-
0.12.1 Released! - Thanks to everyone who reviewed the release and
thanks to Kyle for managing it!
-
Spark 3.2 Progress - added support for things like dynamic filtering to
work with v2 sources as well as a new interface for driving sort-order
through table properties. The changes here will be key for the merge into
support with deltas.
-
Special thanks to Anton who’s been contributing a lot here!
-
Bug fixes
-
Avro read path
-
Vectorized reader in Spark
-
Delete File Compaction - The normal rewrite files compaction can be
configured to detect too many delete files for a particular data file and
compact them (Thanks Jack!)
Upcoming 0.13.0 Release
-
Iceberg 0.13.0 Release Note Draft
<https://docs.google.com/document/d/18yc8_Q6Hpc_r7JSoQO4oswQSHgHxJFDnr6Zif9_tceA/edit#heading=h.9jffz1lgqlib>
-
We’re aiming for releasing often so including pending changes in a
future release is preferred over delaying a release to squeeze it in.
-
Spark regressions: For the Spark 3.2 branch, some major changes were
expected for dynamic filtering and all of the row based commands so MERGE,
DELETE FROM and UPDATE are missing in the 3.2 branch. We’re currently
thinking through how to resolve this before the release, such as
potentially porting them for now.
-
A new 0.13.0 milestone will be created soon
-
A release candidate can be expected soon, hopefully with the
resequencing and Alibaba file io changes merged in
Java and Python Catalog Consistency
-
On a per catalog implementation basis, it makes sense to keep the
implementations aligned between the Java and Python clients
-
For now, let’s lean on thorough documentation for each catalog type and
expected behaviors, and then generally look for this consistency during PR
reviews
-
The REST catalog is probably the most suitable for providing a detailed
catalog specification
-
Trying to achieve this consistency shouldn’t hold up any of the python
development
REST based Catalog
-
This provides a very flexible mechanism for creating various types of
catalogs
-
Beyond conforming to the REST API specification, this creates room for a
lot of variability on how the transactions are implemented server-side
RemoveOrphanFilesAction
-
Pull Request #1471 <https://github.com/apache/iceberg/pull/1471>
-
Problem Description: Currently in delete orphan files we do a diff of
valid data files and a listing of the directories. Differences in write
configuration and the configuration when deleting orphan files can cause
some orphan files to go undetected.
-
This has been discussed before and the conclusion was that we should not
introduce configurations for ignoring certain components of uris. This
causes other issues such as ignoring the authority for s3 which ignores the
bucket in the uri. More complications are introduced when you consider that
many tables can share a bucket/prefix.
-
Follow-up: Let’s try and get a comprehensive list of different scenarios
and implications
Trino Support for Merge on Read/Write
-
There are some serialization concerns here that need to be addressed and
the current open PRs may get redesigned soon.
-
A lot of JSON serialization is being developed as part of the REST
catalog implementation so that may solve some of the issues here.
-
Ideally, serialization can be kept somewhat separate from the rest of
the code base.
-
Schema evolution implications need to be considered here as well.
Thanks everyone!