You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/08 06:52:49 UTC

[GitHub] [iceberg] capkurmagati commented on issue #2221: Spark: Extend expire_snapshots procedure with an optional arg for snapshot ids

capkurmagati commented on issue #2221:
URL: https://github.com/apache/iceberg/issues/2221#issuecomment-774916800


   I did something like the below to expire selected snapshots via Java API
   ```scala
   val snapshots = sql("select * from iceberg.db.tbl.snapshots")
   val snapshotsToDel = snapshots.select("snapshot_id").filter($"committed_at" < "2020-10-06 12:59:00" ).filter($"committed_at" > "2020-10-06 12:01:00" ).collectAsList
   snapshotsToDel.forEach( id => table.expireSnapshots.expireSnapshotId(id.getLong(0)).commit()  )
   ```
   I don't if it's a common use case but our use case is:
   1. We have a CDC pipeline for an online RDBMS table. The writer consumes the CDC log and writes to Iceberg every 15min.
   2. The users can query/time travel hot data via 15-min snapshots.
   3. Once the data became cold. (usually after 2 weeks in our environment). We want to downsample the snapshots to reduce the data size. To keep the `hourly` snapshots only and remove those in between.
   4. (Haven't done it yet). Maybe we want to downsample to daily after a longer period. (Our previous batch ingestion pipeline is based on daily Sqoop jobs)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org