You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Edgar Rodriguez <ed...@airbnb.com.INVALID> on 2021/03/08 18:43:54 UTC

Migrating legacy snapshot daily Hive table concept to Iceberg

Hi folks,

I’d like to request some feedback on how to use Iceberg to approach a use
case we have, that I believe some other folks could be facing, since this
was a pattern usually followed with Hive tables.

Use case:
1. We used to have database table snapshots exported daily at 0 UTC. Each
day a new partition is created with a materialized snapshot (e.g.
ds=2021-02-01, ds=2021-02-02, ...)
2. We have a lot of queries written against this legacy structure.
3. We would like to start migrating to Iceberg by writing a table snapshot
and periodically committing mutations (e.g. every half hour).
4. We are trying to retain the legacy interface (`ds` partition as a
snapshot) to support the myriad of existing queries, which sometimes target
multiple snapshots at the same time so that old queries continue to work,
while new queries are written directly against Iceberg tables using time
travel.

Issues:
An issue I see moving this use case to Iceberg is on the interface, as many
users already have queries using the `ds` partitioning column to use the
snapshot - also note that in this approach users NEED to know specifically
that they can only query these tables with a `ds` filter, otherwise they
could get duplicated rows. One thought we had to solve this was to use a
thin wrapper, for instance in Hive a custom table InputFormat that takes
the filter expression (with the `ds`) and maps it to a snapshot using a
JSON config file (which holds the snapshot-id to ds mapping); and something
similar for Spark. This solution is very custom to the use case, and makes
a lot of assumptions, but I guess the idea is to present this specific
interface to users while using Iceberg - however, this could be a
transitioning phase until user queries are fully migrated to using
snapshots directly.

I still think Iceberg would be a good candidate to avoid duplicating data
and simplify users' requirement on knowing the partitioning and its implied
meaning before querying the table.

How are other folks with the same use case solving this with Iceberg?

----

On Iceberg snapshots:
I know that in Iceberg we want to abstract partitioning as much as possible
from the user, since this is really powerful. My initial thought is to use
the natively supported table snapshots and time travel in Iceberg. However,
it’s not straightforward for users to use a snapshot-id, and snapshots may
not exactly correspond to the data at a given timestamp, only to the point
on when the change was applied to the table, e.g. If I want the table data
for 2021-01-01 00:00:00 UTC the commit that was created for that particular
cut-over was done in 2021-01-01 06:00:00 UTC, so using timestamp is not
straightforward either.
Would it make sense to introduce a `snapshot-tag` concept that could be
used to refer to a particular snapshot? I guess we could add it in the
Snapshot summary, but there’s no way to use that tag instead of the ID to
refer to the snapshot. This would allow us to tag specific snapshots and
let users use the tags to query the snapshot, simplifying a bit the
migration. Also, we’d need to make sure the tags are unique, same as the
snapshot ids. In a way I think of this as something similar to Git, where
snapshot-id is akin to commit hash and snapshot-tag is similar to a git
tag. I think this would simplify the way to use snapshots on queries.

I’m happy to hear other approaches. Thanks for reading and the comments in
advance!

Best,
-- 
Edgar R