You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Sam Redai <sa...@tabular.io> on 2022/05/26 16:07:44 UTC
Meeting Minutes from 05/25 Iceberg Sync
Hey Iceberg Community,
Here are the minutes and recording from our Iceberg Sync that took place
today on *May 25th, 9am-10am PT*.
Always remember, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
anyone who is seeking an invite. The notes and the agenda are posted
in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation and it's a good place to add items
as you see fit so we can discuss them in the next community sync.
Meeting Recording ⭕
<https://drive.google.com/file/d/1FISvrM3eEWQuIfQnZAKbRN0Xyk2hH36_/view?usp=sharing>
Top of the Meeting Highlights
-
Added an incremental append scan interface (Thanks, Steven!)
-
Backports for 0.13.2 are done (Thanks, Eduard!)
-
API validation using revapi was added (Thanks, Kyle!)
-
Added all_files and all_delete_files metadata tables (Thanks, Szehon!)
Releases
-
0.13.2
-
All of the backports are merged
-
Milestone <https://github.com/apache/iceberg/milestone/18?closed=1>
with merged PRs
-
1.0.0 (no 0.14 release)
-
LICENSE updates done
-
API checking is done
-
Incremental snapshot expiration still pending
-
Metadata table schema guarantees
-
An alternative option is to release an 0.14 with a quick follow-up
1.0.0 release that removed any deprecations
Agenda
-
Minimum supported python version changed from 3.7 to 3.8
-
Proposal to change from tox to pre-commit: PR #4811
<https://github.com/apache/iceberg/pull/4811>
-
Change scan: PR #4870 <https://github.com/apache/iceberg/pull/4870>
-
Incremental Scans
-
Most CDC operations require excluding rows that are unchanged or join
rows by ID to create a pre-image/post-image
-
Some difficulties around using DataSourceV2 (getting pure
deleted/inserted requires shuffling)
-
One option is defining a view that uses incremental scans to do a
pre-image/post-image analysis (View catalog has not been added
to Spark yet
but there’s an existing SPIP-31357
<https://issues.apache.org/jira/browse/SPARK-31357> and PR #35636
<https://github.com/apache/spark/pull/35636>)
-
Puffin - new name for Index and Stats file-format
-
Secondary index metadata as blobs of binary data
-
Theta sketches
-
Snapshot branching and tagging syntax
-
Option 1: `<database>.<table>.<branch_name>`
-
Potential conflicts with metadata table names (possibly a rare
occurence)
-
Option 2: Qualifying prefixes such as `branch$<branch_name>` or
`tag:<tag_name>`.
-
No standardized way in SQL to specify a tag or branch, which could
delay implementation upstream
-
Option 3: An option/context setting
-
Setting options currently not possible in spark-sql
-
Before implementing any of this logic, let’s work through a proposal
Thanks everyone!
--
Sam Redai <sa...@tabular.io>
Developer Advocate | Tabular <https://tabular.io/>