You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by "timmycheng (程力)" <ti...@tencent.com> on 2019/07/18 01:56:21 UTC
Re: Getting delta of data changes between 2 Snapshots(Internet mail)

Like having a system table to store in-use snapshot? Isn’t the incremental processing much like incremental pulling in Hudi?

-Li

发件人: Ryan Blue <rb...@netflix.com.INVALID>
答复: "dev@iceberg.apache.org" <de...@iceberg.apache.org>, "rblue@netflix.com" <rb...@netflix.com>
日期: 2019年7月18日 星期四 上午3:55
收件人: RD <rd...@gmail.com>
抄送: Iceberg Dev List <de...@iceberg.apache.org>
主题: Re: Getting delta of data changes between 2 Snapshots(Internet mail)

I think it would be helpful to have a pattern for incremental processing. Iceberg can give you the data files that were added or deleted in a snapshot, but there isn't a good way to take those and actually read them as a DataFrame or select that data from a table in SQL. I'd think that's a good first step, but I'd also like to work on other concerns, like rollback and reprocessing. It would also be nice if such a system had a way to block snapshot expiration until all downstream incremental processes have completed.

rb

On Wed, Jul 17, 2019 at 12:46 PM RD <rd...@gmail.com>> wrote:
Hi Iceberg devs,
   We are starting work on a somewhat similar project. The idea is that users can ask for incremental data since the last snapshot they processed, i.e the delta that was added since the last snapshot. Do you guys think that whether this can be a general feature that can we beneficial to Iceberg?

-R

On Wed, Jul 17, 2019 at 10:16 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

You can do this using time-travel. First, read the table at each snapshot. This creates a temporary table for both snapshots:

// create temp tables for each snapshot

spark.read.format("iceberg").option("snapshot-id", 8924558786060583479L).load("db.table").createOrReplaceTempTable("s1")

spark.read.format("iceberg").option("snapshot-id", 6536733823181975045L).load("db.table").createOrReplaceTempTable("s2")

Next, use a left-outer join and a filter to find the rows that are only in the second snapshot:

SELECT s2.* FROM (SELECT * FROM s2 LEFT OUTER JOIN s1 ON s1.id<http://s1.id> = s2.id<http://s2.id>) WHERE s1.id<http://s1.id> IS NULL

Note that this is an expensive operation if you don’t also filter the snapshots. You probably want to do this with a small table, or you want to use filters to cut down on the data loaded that might match. For example, if you’re writing events as they come in from Kafka, you can filter each snapshot using the last day of data, then run the left outer join.

On Wed, Jul 17, 2019 at 9:50 AM aa bb <ab...@yahoo.com.invalid> wrote:
Hi,

Could you please advise how we can get delta data changes (diff) between 2 Snapshots?

Is there any way providing 2 Snapshot Ids (8924558786060583479, 6536733823181975045) and get records that added after 8924558786060583479 ?

+-------------------------+---------------------+---------------------+---------------------+

| made_current_at         | snapshot_id         | parent_id           | is_current_ancestor |

+-------------------------+---------------------+---------------------+---------------------+

| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true                |

| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true                |

+-------------------------+---------------------+---------------------+---------------------+



Thanks,


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix