You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2019/07/19 21:00:21 UTC

[DISCUSS] Write-audit-publish support

Hi everyone,

At Netflix, we have a pattern for building ETL jobs where we write data,
then audit the result before publishing the data that was written to a
final table. We call this WAP for write, audit, publish.

We’ve added support in our Iceberg branch. A WAP write creates a new table
snapshot, but doesn’t make that snapshot the current version of the table.
Instead, a separate process audits the new snapshot and updates the table’s
current snapshot when the audits succeed. I wasn’t sure that this would be
useful anywhere else until we talked to another company this week that is
interested in the same thing. So I wanted to check whether this is a good
feature to include in Iceberg itself.

This works by staging a snapshot. Basically, Spark writes data as expected,
but Iceberg detects that it should not update the table’s current stage.
That happens when there is a Spark property, spark.wap.id, that indicates
the job is a WAP job. Then any table that has WAP enabled by the table
property write.wap.enabled=true will stage the new snapshot instead of
fully committing, with the WAP ID in the snapshot’s metadata.

Is this something we should open a PR to add to Iceberg? It seems a little
strange to make it appear that a commit has succeeded, but not actually
change a table, which is why we didn’t submit it before now.

Thanks,

rb
-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by Miao Wang <mi...@adobe.com.INVALID>.

From a timeline perspective, we can’t work on implementing this feature in next a couple of months. For short term workaround, we choose a lock mechanism at application level.

@Anton Okolnychyi<ma...@apple.com.INVALID> If you can pick up this feature, it will be great!

Thanks!

Miao

From: Ryan Blue <rb...@netflix.com.INVALID>
Reply-To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>, "rblue@netflix.com" <rb...@netflix.com>
Date: Monday, November 11, 2019 at 11:54 AM
To: Anton Okolnychyi <ao...@apple.com>
Cc: Iceberg Dev List <de...@iceberg.apache.org>, Ashish Mehta <me...@gmail.com>
Subject: Re: [DISCUSS] Write-audit-publish support

I just had a direct request for this over the weekend, too. I opened #629 Add cherry-pick operation<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F629&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904592246&sdata=XsPllVj3l5DZeMDrI248W2timQywQXNjpbRSg9nMppg%3D&reserved=0> to track this.

On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi <ao...@apple.com>> wrote:
We would be interested in this functionality as well. We have a use case with multiple concurrent writers where we wanted to use WAP but couldn’t.


On 9 Nov 2019, at 01:32, Ryan Blue <rb...@netflix.com.INVALID>> wrote:

Right now, there isn't a good way to manage multiple pending writes. Snapshots from each write are created based on the current table state, so simply moving to one of two pending commits would mean you ignore the changes in the other pending commit. We've considered adding a "cherry-pick" operation that can take the changes from one snapshot and apply them on top of another to solve that problem. If you'd like to implement that, I'd be happy to review it!

On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <me...@gmail.com>> wrote:
Thanks Ryan, that worked out. Since its a rollback, I wonder how can user stage multiple WAP snapshots, and commit then in any order, based on how Audit process work out?
I wonder this expectation, goes against the underlying principles of Iceberg.

Thanks,
Ashish

On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid>> wrote:

Ashish, you can use the rollback table operation to set a particular snapshot as the current table state. Like this:

Table table = hiveCatalog.load(name);

table.rollback().toSnapshotId(id).commmit();

On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <me...@gmail.com>> wrote:
Hi Ryan,

Can you please help me point to doc, where I can find how to publish a WAP snapshot? I am able to filter the snapshot, based on wap.id<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwap.id%2F&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904602247&sdata=UO%2Fc%2Bz2pqZqUrllKAbAsCA%2Bg5B1MnJmF3ysl9JuLqv0%3D&reserved=0> in summary of Snapshot, but clueless the official recommendation on committing that snapshot. I can think of cherry-picking Appended/Deleted files, but don't know the nuances of missing something important with this.

Thanks,
-Ashish

---------- Forwarded message ---------
From: Ryan Blue <rb...@netflix.com.invalid>>
Date: Wed, Jul 31, 2019 at 4:41 PM
Subject: Re: [DISCUSS] Write-audit-publish support
To: Edgar Rodriguez <ed...@airbnb.com>>
Cc: Iceberg Dev List <de...@iceberg.apache.org>>, Anton Okolnychyi <ao...@apple.com>>

Hi everyone, I've added PR #342<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F342&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904602247&sdata=dzwyt0wOGwoIsVDGyUapQn2S%2F%2FhWsxdiBOnfdj2GClA%3D&reserved=0> to the Iceberg repository with our WAP changes. Please have a look if you were interested in this.

On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <ed...@airbnb.com>> wrote:
I think this use case is pretty helpful in most data environments, we do the same sort of stage-check-publish pattern to run quality checks.
One question is, if say the audit part fails, is there a way to expire the snapshot or what would be the workflow that follows?

Best,
Edgar

On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <mo...@gmail.com>> wrote:
This would be super helpful. We have a similar workflow where we do some validation before letting the downstream consume the changes.

Best,
Mouli

On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com>> wrote:
This definitely sounds interesting. Quick question on whether this presents impact on the current Upserts spec? Or is it maybe that we are looking to associate this support for append-only commits?

On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>> wrote:

Audits run on the snapshot by setting the snapshot-id read option to read the WAP snapshot, even though it has not (yet) been the current table state. This is documented in the time travel<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Ficeberg.apache.org%2Fspark%2F%23time-travel&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904612234&sdata=ExDJT3WKuggFsStxnoHDFOHBNO7twA%2BODbG44nwKEK8%3D&reserved=0> section of the Iceberg site.

We added a stageOnly method to SnapshotProducer that adds the snapshot to table metadata, but does not make it the current table state. That is called by the Spark writer when there is a WAP ID, and that ID is embedded in the staged snapshot’s metadata so processes can find it.

I'll add a PR with this code, since there is interest.

rb

On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <ao...@apple.com>> wrote:
I would also support adding this to Iceberg itself. I think we have a use case where we can leverage this.

@Ryan, could you also provide more info on the audit process?

Thanks,
Anton


On 20 Jul 2019, at 04:01, RD <rd...@gmail.com>> wrote:

I think this could be useful. When we ingest data from Kafka, we do a predefined set of checks on the data. We can potentially utilize something like this to check for sanity before publishing.

How is the auditing process suppose to find the new snapshot , since it is not accessible from the table. Is it by convention?

-R

On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>> wrote:

Hi everyone,

At Netflix, we have a pattern for building ETL jobs where we write data, then audit the result before publishing the data that was written to a final table. We call this WAP for write, audit, publish.

We’ve added support in our Iceberg branch. A WAP write creates a new table snapshot, but doesn’t make that snapshot the current version of the table. Instead, a separate process audits the new snapshot and updates the table’s current snapshot when the audits succeed. I wasn’t sure that this would be useful anywhere else until we talked to another company this week that is interested in the same thing. So I wanted to check whether this is a good feature to include in Iceberg itself.

This works by staging a snapshot. Basically, Spark writes data as expected, but Iceberg detects that it should not update the table’s current stage. That happens when there is a Spark property, spark.wap.id<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.wap.id%2F&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904612234&sdata=dfJvCchcsO6lT9tkt9xU5TQD%2BT%2Bnz4GYFqUKZjmQxFo%3D&reserved=0>, that indicates the job is a WAP job. Then any table that has WAP enabled by the table property write.wap.enabled=true will stage the new snapshot instead of fully committing, with the WAP ID in the snapshot’s metadata.

Is this something we should open a PR to add to Iceberg? It seems a little strange to make it appear that a commit has succeeded, but not actually change a table, which is why we didn’t submit it before now.

Thanks,

rb
--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--
Filip Bocse


--
Edgar Rodriguez


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I just had a direct request for this over the weekend, too. I opened #629
Add cherry-pick operation
<https://github.com/apache/incubator-iceberg/issues/629> to track this.

On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi <ao...@apple.com>
wrote:

> We would be interested in this functionality as well. We have a use case
> with multiple concurrent writers where we wanted to use WAP but couldn’t.
>
> On 9 Nov 2019, at 01:32, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> Right now, there isn't a good way to manage multiple pending writes.
> Snapshots from each write are created based on the current table state, so
> simply moving to one of two pending commits would mean you ignore the
> changes in the other pending commit. We've considered adding a
> "cherry-pick" operation that can take the changes from one snapshot and
> apply them on top of another to solve that problem. If you'd like to
> implement that, I'd be happy to review it!
>
> On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <me...@gmail.com>
> wrote:
>
>> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
>> stage multiple WAP snapshots, and commit then in any order, based on how
>> Audit process work out?
>> I wonder this expectation, goes against the underlying principles of
>> Iceberg.
>>
>> Thanks,
>> Ashish
>>
>> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Ashish, you can use the rollback table operation to set a particular
>>> snapshot as the current table state. Like this:
>>>
>>> Table table = hiveCatalog.load(name);
>>> table.rollback().toSnapshotId(id).commmit();
>>>
>>>
>>> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <me...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> Can you please help me point to doc, where I can find how to publish a
>>>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>>>> summary of Snapshot, but clueless the official recommendation on
>>>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>>>> files, but don't know the nuances of missing something important with this.
>>>>
>>>> Thanks,
>>>> -Ashish
>>>>
>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Ryan Blue <rb...@netflix.com.invalid>
>>>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>>>> To: Edgar Rodriguez <ed...@airbnb.com>
>>>>> Cc: Iceberg Dev List <de...@iceberg.apache.org>, Anton Okolnychyi <
>>>>> aokolnychyi@apple.com>
>>>>>
>>>>>
>>>>> Hi everyone, I've added PR #342
>>>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>>>> repository with our WAP changes. Please have a look if you were interested
>>>>> in this.
>>>>>
>>>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>>>> edgar.rodriguez@airbnb.com> wrote:
>>>>>
>>>>>> I think this use case is pretty helpful in most data environments, we
>>>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>>>> One question is, if say the audit part fails, is there a way to
>>>>>> expire the snapshot or what would be the workflow that follows?
>>>>>>
>>>>>> Best,
>>>>>> Edgar
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>>>> moulimukherjee@gmail.com> wrote:
>>>>>>
>>>>>>> This would be super helpful. We have a similar workflow where we do
>>>>>>> some validation before letting the downstream consume the changes.
>>>>>>>
>>>>>>> Best,
>>>>>>> Mouli
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>>>> looking to associate this support for append-only commits?
>>>>>>>>
>>>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <
>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>>>> to read the WAP snapshot, even though it has not (yet) been the current
>>>>>>>>> table state. This is documented in the time travel
>>>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>>>> Iceberg site.
>>>>>>>>>
>>>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>>>>
>>>>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>>>>> aokolnychyi@apple.com> wrote:
>>>>>>>>>
>>>>>>>>>> I would also support adding this to Iceberg itself. I think we
>>>>>>>>>> have a use case where we can leverage this.
>>>>>>>>>>
>>>>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Anton
>>>>>>>>>>
>>>>>>>>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I think this could be useful. When we ingest data from Kafka, we
>>>>>>>>>> do a predefined set of checks on the data. We can potentially utilize
>>>>>>>>>> something like this to check for sanity before publishing.
>>>>>>>>>>
>>>>>>>>>> How is the auditing process suppose to find the new snapshot ,
>>>>>>>>>> since it is not accessible from the table. Is it by convention?
>>>>>>>>>>
>>>>>>>>>> -R
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <
>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we
>>>>>>>>>>> write data, then audit the result before publishing the data that was
>>>>>>>>>>> written to a final table. We call this WAP for write, audit, publish.
>>>>>>>>>>>
>>>>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>>>>> new table snapshot, but doesn’t make that snapshot the current version of
>>>>>>>>>>> the table. Instead, a separate process audits the new snapshot and updates
>>>>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure that
>>>>>>>>>>> this would be useful anywhere else until we talked to another company this
>>>>>>>>>>> week that is interested in the same thing. So I wanted to check whether
>>>>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>>>>
>>>>>>>>>>> This works by staging a snapshot. Basically, Spark writes data
>>>>>>>>>>> as expected, but Iceberg detects that it should not update the table’s
>>>>>>>>>>> current stage. That happens when there is a Spark property,
>>>>>>>>>>> spark.wap.id, that indicates the job is a WAP job. Then any
>>>>>>>>>>> table that has WAP enabled by the table property
>>>>>>>>>>> write.wap.enabled=true will stage the new snapshot instead of
>>>>>>>>>>> fully committing, with the WAP ID in the snapshot’s metadata.
>>>>>>>>>>>
>>>>>>>>>>> Is this something we should open a PR to add to Iceberg? It
>>>>>>>>>>> seems a little strange to make it appear that a commit has succeeded, but
>>>>>>>>>>> not actually change a table, which is why we didn’t submit it before now.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Filip Bocse
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Edgar Rodriguez
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

We would be interested in this functionality as well. We have a use case with multiple concurrent writers where we wanted to use WAP but couldn’t.

> On 9 Nov 2019, at 01:32, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Right now, there isn't a good way to manage multiple pending writes. Snapshots from each write are created based on the current table state, so simply moving to one of two pending commits would mean you ignore the changes in the other pending commit. We've considered adding a "cherry-pick" operation that can take the changes from one snapshot and apply them on top of another to solve that problem. If you'd like to implement that, I'd be happy to review it!
> 
> On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <mehta.ashish23@gmail.com <ma...@gmail.com>> wrote:
> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user stage multiple WAP snapshots, and commit then in any order, based on how Audit process work out?
> I wonder this expectation, goes against the underlying principles of Iceberg. 
> 
> Thanks,
> Ashish
> 
> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Ashish, you can use the rollback table operation to set a particular snapshot as the current table state. Like this:
> 
> Table table = hiveCatalog.load(name);
> table.rollback().toSnapshotId(id).commmit();
> 
> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashish23@gmail.com <ma...@gmail.com>> wrote:
> Hi Ryan, 
> 
> Can you please help me point to doc, where I can find how to publish a WAP snapshot? I am able to filter the snapshot, based on wap.id <http://wap.id/> in summary of Snapshot, but clueless the official recommendation on committing that snapshot. I can think of cherry-picking Appended/Deleted files, but don't know the nuances of missing something important with this.
> 
> Thanks,
> -Ashish
>  
> ---------- Forwarded message ---------
> From: Ryan Blue <rb...@netflix.com.invalid>
> Date: Wed, Jul 31, 2019 at 4:41 PM
> Subject: Re: [DISCUSS] Write-audit-publish support
> To: Edgar Rodriguez <edgar.rodriguez@airbnb.com <ma...@airbnb.com>>
> Cc: Iceberg Dev List <dev@iceberg.apache.org <ma...@iceberg.apache.org>>, Anton Okolnychyi <aokolnychyi@apple.com <ma...@apple.com>>
> 
> 
> Hi everyone, I've added PR #342 <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg repository with our WAP changes. Please have a look if you were interested in this.
> 
> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <edgar.rodriguez@airbnb.com <ma...@airbnb.com>> wrote:
> I think this use case is pretty helpful in most data environments, we do the same sort of stage-check-publish pattern to run quality checks. 
> One question is, if say the audit part fails, is there a way to expire the snapshot or what would be the workflow that follows?
> 
> Best,
> Edgar
> 
> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <moulimukherjee@gmail.com <ma...@gmail.com>> wrote:
> This would be super helpful. We have a similar workflow where we do some validation before letting the downstream consume the changes.
> 
> Best,
> Mouli
> 
> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip.scm@gmail.com <ma...@gmail.com>> wrote:
> This definitely sounds interesting. Quick question on whether this presents impact on the current Upserts spec? Or is it maybe that we are looking to associate this support for append-only commits?
> 
> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Audits run on the snapshot by setting the snapshot-id read option to read the WAP snapshot, even though it has not (yet) been the current table state. This is documented in the time travel <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg site.
> 
> We added a stageOnly method to SnapshotProducer that adds the snapshot to table metadata, but does not make it the current table state. That is called by the Spark writer when there is a WAP ID, and that ID is embedded in the staged snapshot’s metadata so processes can find it.
> 
> I'll add a PR with this code, since there is interest.
> 
> rb
> 
> 
> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <aokolnychyi@apple.com <ma...@apple.com>> wrote:
> I would also support adding this to Iceberg itself. I think we have a use case where we can leverage this.
> 
> @Ryan, could you also provide more info on the audit process?
> 
> Thanks,
> Anton
> 
>> On 20 Jul 2019, at 04:01, RD <rdsr.me@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I think this could be useful. When we ingest data from Kafka, we do a predefined set of checks on the data. We can potentially utilize something like this to check for sanity before publishing.  
>> 
>> How is the auditing process suppose to find the new snapshot , since it is not accessible from the table. Is it by convention?
>> 
>> -R 
>> 
>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>> Hi everyone,
>> 
>> At Netflix, we have a pattern for building ETL jobs where we write data, then audit the result before publishing the data that was written to a final table. We call this WAP for write, audit, publish.
>> 
>> We’ve added support in our Iceberg branch. A WAP write creates a new table snapshot, but doesn’t make that snapshot the current version of the table. Instead, a separate process audits the new snapshot and updates the table’s current snapshot when the audits succeed. I wasn’t sure that this would be useful anywhere else until we talked to another company this week that is interested in the same thing. So I wanted to check whether this is a good feature to include in Iceberg itself.
>> 
>> This works by staging a snapshot. Basically, Spark writes data as expected, but Iceberg detects that it should not update the table’s current stage. That happens when there is a Spark property, spark.wap.id <http://spark.wap.id/>, that indicates the job is a WAP job. Then any table that has WAP enabled by the table property write.wap.enabled=true will stage the new snapshot instead of fully committing, with the WAP ID in the snapshot’s metadata.
>> 
>> Is this something we should open a PR to add to Iceberg? It seems a little strange to make it appear that a commit has succeeded, but not actually change a table, which is why we didn’t submit it before now.
>> 
>> Thanks,
>> 
>> rb
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Filip Bocse
> 
> 
> -- 
> Edgar Rodriguez
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Right now, there isn't a good way to manage multiple pending writes.
Snapshots from each write are created based on the current table state, so
simply moving to one of two pending commits would mean you ignore the
changes in the other pending commit. We've considered adding a
"cherry-pick" operation that can take the changes from one snapshot and
apply them on top of another to solve that problem. If you'd like to
implement that, I'd be happy to review it!

On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <me...@gmail.com>
wrote:

> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
> stage multiple WAP snapshots, and commit then in any order, based on how
> Audit process work out?
> I wonder this expectation, goes against the underlying principles of
> Iceberg.
>
> Thanks,
> Ashish
>
> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Ashish, you can use the rollback table operation to set a particular
>> snapshot as the current table state. Like this:
>>
>> Table table = hiveCatalog.load(name);
>> table.rollback().toSnapshotId(id).commmit();
>>
>>
>> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <me...@gmail.com>
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> Can you please help me point to doc, where I can find how to publish a
>>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>>> summary of Snapshot, but clueless the official recommendation on
>>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>>> files, but don't know the nuances of missing something important with this.
>>>
>>> Thanks,
>>> -Ashish
>>>
>>>
>>>> ---------- Forwarded message ---------
>>>> From: Ryan Blue <rb...@netflix.com.invalid>
>>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>>> To: Edgar Rodriguez <ed...@airbnb.com>
>>>> Cc: Iceberg Dev List <de...@iceberg.apache.org>, Anton Okolnychyi <
>>>> aokolnychyi@apple.com>
>>>>
>>>>
>>>> Hi everyone, I've added PR #342
>>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>>> repository with our WAP changes. Please have a look if you were interested
>>>> in this.
>>>>
>>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>>> edgar.rodriguez@airbnb.com> wrote:
>>>>
>>>>> I think this use case is pretty helpful in most data environments, we
>>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>>> One question is, if say the audit part fails, is there a way to expire
>>>>> the snapshot or what would be the workflow that follows?
>>>>>
>>>>> Best,
>>>>> Edgar
>>>>>
>>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>>> moulimukherjee@gmail.com> wrote:
>>>>>
>>>>>> This would be super helpful. We have a similar workflow where we do
>>>>>> some validation before letting the downstream consume the changes.
>>>>>>
>>>>>> Best,
>>>>>> Mouli
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:
>>>>>>
>>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>>> looking to associate this support for append-only commits?
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>>> to read the WAP snapshot, even though it has not (yet) been the current
>>>>>>>> table state. This is documented in the time travel
>>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>>> Iceberg site.
>>>>>>>>
>>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>>>
>>>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>>>> aokolnychyi@apple.com> wrote:
>>>>>>>>
>>>>>>>>> I would also support adding this to Iceberg itself. I think we
>>>>>>>>> have a use case where we can leverage this.
>>>>>>>>>
>>>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anton
>>>>>>>>>
>>>>>>>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I think this could be useful. When we ingest data from Kafka, we
>>>>>>>>> do a predefined set of checks on the data. We can potentially utilize
>>>>>>>>> something like this to check for sanity before publishing.
>>>>>>>>>
>>>>>>>>> How is the auditing process suppose to find the new snapshot ,
>>>>>>>>> since it is not accessible from the table. Is it by convention?
>>>>>>>>>
>>>>>>>>> -R
>>>>>>>>>
>>>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <
>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we
>>>>>>>>>> write data, then audit the result before publishing the data that was
>>>>>>>>>> written to a final table. We call this WAP for write, audit, publish.
>>>>>>>>>>
>>>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>>>> new table snapshot, but doesn’t make that snapshot the current version of
>>>>>>>>>> the table. Instead, a separate process audits the new snapshot and updates
>>>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure that
>>>>>>>>>> this would be useful anywhere else until we talked to another company this
>>>>>>>>>> week that is interested in the same thing. So I wanted to check whether
>>>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>>>
>>>>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>>>>> expected, but Iceberg detects that it should not update the table’s current
>>>>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled by
>>>>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>>>>>> metadata.
>>>>>>>>>>
>>>>>>>>>> Is this something we should open a PR to add to Iceberg? It seems
>>>>>>>>>> a little strange to make it appear that a commit has succeeded, but not
>>>>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Filip Bocse
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Edgar Rodriguez
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by Ashish Mehta <me...@gmail.com>.

Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
stage multiple WAP snapshots, and commit then in any order, based on how
Audit process work out?
I wonder this expectation, goes against the underlying principles of
Iceberg.

Thanks,
Ashish

On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Ashish, you can use the rollback table operation to set a particular
> snapshot as the current table state. Like this:
>
> Table table = hiveCatalog.load(name);
> table.rollback().toSnapshotId(id).commmit();
>
>
> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <me...@gmail.com>
> wrote:
>
>> Hi Ryan,
>>
>> Can you please help me point to doc, where I can find how to publish a
>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>> summary of Snapshot, but clueless the official recommendation on
>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>> files, but don't know the nuances of missing something important with this.
>>
>> Thanks,
>> -Ashish
>>
>>
>>> ---------- Forwarded message ---------
>>> From: Ryan Blue <rb...@netflix.com.invalid>
>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>> To: Edgar Rodriguez <ed...@airbnb.com>
>>> Cc: Iceberg Dev List <de...@iceberg.apache.org>, Anton Okolnychyi <
>>> aokolnychyi@apple.com>
>>>
>>>
>>> Hi everyone, I've added PR #342
>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>> repository with our WAP changes. Please have a look if you were interested
>>> in this.
>>>
>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>> edgar.rodriguez@airbnb.com> wrote:
>>>
>>>> I think this use case is pretty helpful in most data environments, we
>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>> One question is, if say the audit part fails, is there a way to expire
>>>> the snapshot or what would be the workflow that follows?
>>>>
>>>> Best,
>>>> Edgar
>>>>
>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>> moulimukherjee@gmail.com> wrote:
>>>>
>>>>> This would be super helpful. We have a similar workflow where we do
>>>>> some validation before letting the downstream consume the changes.
>>>>>
>>>>> Best,
>>>>> Mouli
>>>>>
>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:
>>>>>
>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>> looking to associate this support for append-only commits?
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>> to read the WAP snapshot, even though it has not (yet) been the current
>>>>>>> table state. This is documented in the time travel
>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>> Iceberg site.
>>>>>>>
>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>>
>>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>>> aokolnychyi@apple.com> wrote:
>>>>>>>
>>>>>>>> I would also support adding this to Iceberg itself. I think we have
>>>>>>>> a use case where we can leverage this.
>>>>>>>>
>>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anton
>>>>>>>>
>>>>>>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> I think this could be useful. When we ingest data from Kafka, we do
>>>>>>>> a predefined set of checks on the data. We can potentially utilize
>>>>>>>> something like this to check for sanity before publishing.
>>>>>>>>
>>>>>>>> How is the auditing process suppose to find the new snapshot ,
>>>>>>>> since it is not accessible from the table. Is it by convention?
>>>>>>>>
>>>>>>>> -R
>>>>>>>>
>>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <
>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>>>>> data, then audit the result before publishing the data that was written to
>>>>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>>>>
>>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>>> new table snapshot, but doesn’t make that snapshot the current version of
>>>>>>>>> the table. Instead, a separate process audits the new snapshot and updates
>>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure that
>>>>>>>>> this would be useful anywhere else until we talked to another company this
>>>>>>>>> week that is interested in the same thing. So I wanted to check whether
>>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>>
>>>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>>>> expected, but Iceberg detects that it should not update the table’s current
>>>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled by
>>>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>>>>> metadata.
>>>>>>>>>
>>>>>>>>> Is this something we should open a PR to add to Iceberg? It seems
>>>>>>>>> a little strange to make it appear that a commit has succeeded, but not
>>>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Filip Bocse
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Edgar Rodriguez
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Write-audit-publish support

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Ashish, you can use the rollback table operation to set a particular
snapshot as the current table state. Like this:

Table table = hiveCatalog.load(name);
table.rollback().toSnapshotId(id).commmit();


On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <me...@gmail.com>
wrote:

> Hi Ryan,
>
> Can you please help me point to doc, where I can find how to publish a WAP
> snapshot? I am able to filter the snapshot, based on wap.id in summary of
> Snapshot, but clueless the official recommendation on committing that
> snapshot. I can think of cherry-picking Appended/Deleted files, but don't
> know the nuances of missing something important with this.
>
> Thanks,
> -Ashish
>
>
>> ---------- Forwarded message ---------
>> From: Ryan Blue <rb...@netflix.com.invalid>
>> Date: Wed, Jul 31, 2019 at 4:41 PM
>> Subject: Re: [DISCUSS] Write-audit-publish support
>> To: Edgar Rodriguez <ed...@airbnb.com>
>> Cc: Iceberg Dev List <de...@iceberg.apache.org>, Anton Okolnychyi <
>> aokolnychyi@apple.com>
>>
>>
>> Hi everyone, I've added PR #342
>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>> repository with our WAP changes. Please have a look if you were interested
>> in this.
>>
>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>> edgar.rodriguez@airbnb.com> wrote:
>>
>>> I think this use case is pretty helpful in most data environments, we do
>>> the same sort of stage-check-publish pattern to run quality checks.
>>> One question is, if say the audit part fails, is there a way to expire
>>> the snapshot or what would be the workflow that follows?
>>>
>>> Best,
>>> Edgar
>>>
>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>> moulimukherjee@gmail.com> wrote:
>>>
>>>> This would be super helpful. We have a similar workflow where we do
>>>> some validation before letting the downstream consume the changes.
>>>>
>>>> Best,
>>>> Mouli
>>>>
>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:
>>>>
>>>>> This definitely sounds interesting. Quick question on whether this
>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>> looking to associate this support for append-only commits?
>>>>>
>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>>>> read the WAP snapshot, even though it has not (yet) been the current table
>>>>>> state. This is documented in the time travel
>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>> Iceberg site.
>>>>>>
>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>
>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>> aokolnychyi@apple.com> wrote:
>>>>>>
>>>>>>> I would also support adding this to Iceberg itself. I think we have
>>>>>>> a use case where we can leverage this.
>>>>>>>
>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anton
>>>>>>>
>>>>>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>>>>>
>>>>>>> I think this could be useful. When we ingest data from Kafka, we do
>>>>>>> a predefined set of checks on the data. We can potentially utilize
>>>>>>> something like this to check for sanity before publishing.
>>>>>>>
>>>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>>>> it is not accessible from the table. Is it by convention?
>>>>>>>
>>>>>>> -R
>>>>>>>
>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>>>> data, then audit the result before publishing the data that was written to
>>>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>>>
>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>> new table snapshot, but doesn’t make that snapshot the current version of
>>>>>>>> the table. Instead, a separate process audits the new snapshot and updates
>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure that
>>>>>>>> this would be useful anywhere else until we talked to another company this
>>>>>>>> week that is interested in the same thing. So I wanted to check whether
>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>
>>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>>> expected, but Iceberg detects that it should not update the table’s current
>>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled by
>>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>>>> metadata.
>>>>>>>>
>>>>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> rb
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Filip Bocse
>>>>>
>>>>
>>>
>>> --
>>> Edgar Rodriguez
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

[DISCUSS] Write-audit-publish support

Posted by Ashish Mehta <me...@gmail.com>.

Hi Ryan,

Can you please help me point to doc, where I can find how to publish a WAP
snapshot? I am able to filter the snapshot, based on wap.id in summary of
Snapshot, but clueless the official recommendation on committing that
snapshot. I can think of cherry-picking Appended/Deleted files, but don't
know the nuances of missing something important with this.

Thanks,
-Ashish


> ---------- Forwarded message ---------
> From: Ryan Blue <rb...@netflix.com.invalid>
> Date: Wed, Jul 31, 2019 at 4:41 PM
> Subject: Re: [DISCUSS] Write-audit-publish support
> To: Edgar Rodriguez <ed...@airbnb.com>
> Cc: Iceberg Dev List <de...@iceberg.apache.org>, Anton Okolnychyi <
> aokolnychyi@apple.com>
>
>
> Hi everyone, I've added PR #342
> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
> repository with our WAP changes. Please have a look if you were interested
> in this.
>
> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
> edgar.rodriguez@airbnb.com> wrote:
>
>> I think this use case is pretty helpful in most data environments, we do
>> the same sort of stage-check-publish pattern to run quality checks.
>> One question is, if say the audit part fails, is there a way to expire
>> the snapshot or what would be the workflow that follows?
>>
>> Best,
>> Edgar
>>
>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <mo...@gmail.com>
>> wrote:
>>
>>> This would be super helpful. We have a similar workflow where we do some
>>> validation before letting the downstream consume the changes.
>>>
>>> Best,
>>> Mouli
>>>
>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:
>>>
>>>> This definitely sounds interesting. Quick question on whether this
>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>> looking to associate this support for append-only commits?
>>>>
>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>>> read the WAP snapshot, even though it has not (yet) been the current table
>>>>> state. This is documented in the time travel
>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>>>>> site.
>>>>>
>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>
>>>>> I'll add a PR with this code, since there is interest.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>> aokolnychyi@apple.com> wrote:
>>>>>
>>>>>> I would also support adding this to Iceberg itself. I think we have a
>>>>>> use case where we can leverage this.
>>>>>>
>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>
>>>>>> Thanks,
>>>>>> Anton
>>>>>>
>>>>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>>>>
>>>>>> I think this could be useful. When we ingest data from Kafka, we do a
>>>>>> predefined set of checks on the data. We can potentially utilize something
>>>>>> like this to check for sanity before publishing.
>>>>>>
>>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>>> it is not accessible from the table. Is it by convention?
>>>>>>
>>>>>> -R
>>>>>>
>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>>> data, then audit the result before publishing the data that was written to
>>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>>
>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>>>>> table snapshot, but doesn’t make that snapshot the current version of the
>>>>>>> table. Instead, a separate process audits the new snapshot and updates the
>>>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>>>>>> would be useful anywhere else until we talked to another company this week
>>>>>>> that is interested in the same thing. So I wanted to check whether this is
>>>>>>> a good feature to include in Iceberg itself.
>>>>>>>
>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>> expected, but Iceberg detects that it should not update the table’s current
>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled by
>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>>> metadata.
>>>>>>>
>>>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> rb
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>> --
>>>> Filip Bocse
>>>>
>>>
>>
>> --
>> Edgar Rodriguez
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Write-audit-publish support

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi everyone, I've added PR #342
<https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
repository with our WAP changes. Please have a look if you were interested
in this.

On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <ed...@airbnb.com>
wrote:

> I think this use case is pretty helpful in most data environments, we do
> the same sort of stage-check-publish pattern to run quality checks.
> One question is, if say the audit part fails, is there a way to expire the
> snapshot or what would be the workflow that follows?
>
> Best,
> Edgar
>
> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <mo...@gmail.com>
> wrote:
>
>> This would be super helpful. We have a similar workflow where we do some
>> validation before letting the downstream consume the changes.
>>
>> Best,
>> Mouli
>>
>> On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:
>>
>>> This definitely sounds interesting. Quick question on whether this
>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>> looking to associate this support for append-only commits?
>>>
>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>> read the WAP snapshot, even though it has not (yet) been the current table
>>>> state. This is documented in the time travel
>>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>>>> site.
>>>>
>>>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>>>> to table metadata, but does not make it the current table state. That is
>>>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>>>> in the staged snapshot’s metadata so processes can find it.
>>>>
>>>> I'll add a PR with this code, since there is interest.
>>>>
>>>> rb
>>>>
>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <ao...@apple.com>
>>>> wrote:
>>>>
>>>>> I would also support adding this to Iceberg itself. I think we have a
>>>>> use case where we can leverage this.
>>>>>
>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>>
>>>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>>>
>>>>> I think this could be useful. When we ingest data from Kafka, we do a
>>>>> predefined set of checks on the data. We can potentially utilize something
>>>>> like this to check for sanity before publishing.
>>>>>
>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>> it is not accessible from the table. Is it by convention?
>>>>>
>>>>> -R
>>>>>
>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>> data, then audit the result before publishing the data that was written to
>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>
>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>>>> table snapshot, but doesn’t make that snapshot the current version of the
>>>>>> table. Instead, a separate process audits the new snapshot and updates the
>>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>>>>> would be useful anywhere else until we talked to another company this week
>>>>>> that is interested in the same thing. So I wanted to check whether this is
>>>>>> a good feature to include in Iceberg itself.
>>>>>>
>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>> expected, but Iceberg detects that it should not update the table’s current
>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled by
>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>> metadata.
>>>>>>
>>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Filip Bocse
>>>
>>
>
> --
> Edgar Rodriguez
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by Edgar Rodriguez <ed...@airbnb.com.INVALID>.

I think this use case is pretty helpful in most data environments, we do
the same sort of stage-check-publish pattern to run quality checks.
One question is, if say the audit part fails, is there a way to expire the
snapshot or what would be the workflow that follows?

Best,
Edgar

On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <mo...@gmail.com>
wrote:

> This would be super helpful. We have a similar workflow where we do some
> validation before letting the downstream consume the changes.
>
> Best,
> Mouli
>
> On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:
>
>> This definitely sounds interesting. Quick question on whether this
>> presents impact on the current Upserts spec? Or is it maybe that we are
>> looking to associate this support for append-only commits?
>>
>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Audits run on the snapshot by setting the snapshot-id read option to
>>> read the WAP snapshot, even though it has not (yet) been the current table
>>> state. This is documented in the time travel
>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>>> site.
>>>
>>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>>> to table metadata, but does not make it the current table state. That is
>>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>>> in the staged snapshot’s metadata so processes can find it.
>>>
>>> I'll add a PR with this code, since there is interest.
>>>
>>> rb
>>>
>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <ao...@apple.com>
>>> wrote:
>>>
>>>> I would also support adding this to Iceberg itself. I think we have a
>>>> use case where we can leverage this.
>>>>
>>>> @Ryan, could you also provide more info on the audit process?
>>>>
>>>> Thanks,
>>>> Anton
>>>>
>>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>>
>>>> I think this could be useful. When we ingest data from Kafka, we do a
>>>> predefined set of checks on the data. We can potentially utilize something
>>>> like this to check for sanity before publishing.
>>>>
>>>> How is the auditing process suppose to find the new snapshot , since it
>>>> is not accessible from the table. Is it by convention?
>>>>
>>>> -R
>>>>
>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>> data, then audit the result before publishing the data that was written to
>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>
>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>>> table snapshot, but doesn’t make that snapshot the current version of the
>>>>> table. Instead, a separate process audits the new snapshot and updates the
>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>>>> would be useful anywhere else until we talked to another company this week
>>>>> that is interested in the same thing. So I wanted to check whether this is
>>>>> a good feature to include in Iceberg itself.
>>>>>
>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>> expected, but Iceberg detects that it should not update the table’s current
>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled by
>>>>> the table property write.wap.enabled=true will stage the new snapshot
>>>>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>>>>
>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> rb
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Filip Bocse
>>
>

-- 
Edgar Rodriguez

Re: [DISCUSS] Write-audit-publish support

Posted by Mouli Mukherjee <mo...@gmail.com>.

This would be super helpful. We have a similar workflow where we do some
validation before letting the downstream consume the changes.

Best,
Mouli

On Mon, Jul 22, 2019 at 9:18 AM Filip <fi...@gmail.com> wrote:

> This definitely sounds interesting. Quick question on whether this
> presents impact on the current Upserts spec? Or is it maybe that we are
> looking to associate this support for append-only commits?
>
> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Audits run on the snapshot by setting the snapshot-id read option to
>> read the WAP snapshot, even though it has not (yet) been the current table
>> state. This is documented in the time travel
>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>> site.
>>
>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>> to table metadata, but does not make it the current table state. That is
>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>> in the staged snapshot’s metadata so processes can find it.
>>
>> I'll add a PR with this code, since there is interest.
>>
>> rb
>>
>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <ao...@apple.com>
>> wrote:
>>
>>> I would also support adding this to Iceberg itself. I think we have a
>>> use case where we can leverage this.
>>>
>>> @Ryan, could you also provide more info on the audit process?
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>>
>>> I think this could be useful. When we ingest data from Kafka, we do a
>>> predefined set of checks on the data. We can potentially utilize something
>>> like this to check for sanity before publishing.
>>>
>>> How is the auditing process suppose to find the new snapshot , since it
>>> is not accessible from the table. Is it by convention?
>>>
>>> -R
>>>
>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>> data, then audit the result before publishing the data that was written to
>>>> a final table. We call this WAP for write, audit, publish.
>>>>
>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>> table snapshot, but doesn’t make that snapshot the current version of the
>>>> table. Instead, a separate process audits the new snapshot and updates the
>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>>> would be useful anywhere else until we talked to another company this week
>>>> that is interested in the same thing. So I wanted to check whether this is
>>>> a good feature to include in Iceberg itself.
>>>>
>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>> expected, but Iceberg detects that it should not update the table’s current
>>>> stage. That happens when there is a Spark property, spark.wap.id, that
>>>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>>>> table property write.wap.enabled=true will stage the new snapshot
>>>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>>>
>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>> little strange to make it appear that a commit has succeeded, but not
>>>> actually change a table, which is why we didn’t submit it before now.
>>>>
>>>> Thanks,
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Filip Bocse
>

Re: [DISCUSS] Write-audit-publish support

Posted by Filip <fi...@gmail.com>.

This definitely sounds interesting. Quick question on whether this presents
impact on the current Upserts spec? Or is it maybe that we are looking to
associate this support for append-only commits?

On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Audits run on the snapshot by setting the snapshot-id read option to read
> the WAP snapshot, even though it has not (yet) been the current table
> state. This is documented in the time travel
> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
> site.
>
> We added a stageOnly method to SnapshotProducer that adds the snapshot to
> table metadata, but does not make it the current table state. That is
> called by the Spark writer when there is a WAP ID, and that ID is embedded
> in the staged snapshot’s metadata so processes can find it.
>
> I'll add a PR with this code, since there is interest.
>
> rb
>
> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <ao...@apple.com>
> wrote:
>
>> I would also support adding this to Iceberg itself. I think we have a use
>> case where we can leverage this.
>>
>> @Ryan, could you also provide more info on the audit process?
>>
>> Thanks,
>> Anton
>>
>> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>>
>> I think this could be useful. When we ingest data from Kafka, we do a
>> predefined set of checks on the data. We can potentially utilize something
>> like this to check for sanity before publishing.
>>
>> How is the auditing process suppose to find the new snapshot , since it
>> is not accessible from the table. Is it by convention?
>>
>> -R
>>
>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> At Netflix, we have a pattern for building ETL jobs where we write data,
>>> then audit the result before publishing the data that was written to a
>>> final table. We call this WAP for write, audit, publish.
>>>
>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>> table snapshot, but doesn’t make that snapshot the current version of the
>>> table. Instead, a separate process audits the new snapshot and updates the
>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>> would be useful anywhere else until we talked to another company this week
>>> that is interested in the same thing. So I wanted to check whether this is
>>> a good feature to include in Iceberg itself.
>>>
>>> This works by staging a snapshot. Basically, Spark writes data as
>>> expected, but Iceberg detects that it should not update the table’s current
>>> stage. That happens when there is a Spark property, spark.wap.id, that
>>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>>> table property write.wap.enabled=true will stage the new snapshot
>>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>>
>>> Is this something we should open a PR to add to Iceberg? It seems a
>>> little strange to make it appear that a commit has succeeded, but not
>>> actually change a table, which is why we didn’t submit it before now.
>>>
>>> Thanks,
>>>
>>> rb
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Filip Bocse

Re: [DISCUSS] Write-audit-publish support

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Audits run on the snapshot by setting the snapshot-id read option to read
the WAP snapshot, even though it has not (yet) been the current table
state. This is documented in the time travel
<http://iceberg.apache.org/spark/#time-travel> section of the Iceberg site.

We added a stageOnly method to SnapshotProducer that adds the snapshot to
table metadata, but does not make it the current table state. That is
called by the Spark writer when there is a WAP ID, and that ID is embedded
in the staged snapshot’s metadata so processes can find it.

I'll add a PR with this code, since there is interest.

rb

On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <ao...@apple.com>
wrote:

> I would also support adding this to Iceberg itself. I think we have a use
> case where we can leverage this.
>
> @Ryan, could you also provide more info on the audit process?
>
> Thanks,
> Anton
>
> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
>
> I think this could be useful. When we ingest data from Kafka, we do a
> predefined set of checks on the data. We can potentially utilize something
> like this to check for sanity before publishing.
>
> How is the auditing process suppose to find the new snapshot , since it is
> not accessible from the table. Is it by convention?
>
> -R
>
> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi everyone,
>>
>> At Netflix, we have a pattern for building ETL jobs where we write data,
>> then audit the result before publishing the data that was written to a
>> final table. We call this WAP for write, audit, publish.
>>
>> We’ve added support in our Iceberg branch. A WAP write creates a new
>> table snapshot, but doesn’t make that snapshot the current version of the
>> table. Instead, a separate process audits the new snapshot and updates the
>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>> would be useful anywhere else until we talked to another company this week
>> that is interested in the same thing. So I wanted to check whether this is
>> a good feature to include in Iceberg itself.
>>
>> This works by staging a snapshot. Basically, Spark writes data as
>> expected, but Iceberg detects that it should not update the table’s current
>> stage. That happens when there is a Spark property, spark.wap.id, that
>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>> table property write.wap.enabled=true will stage the new snapshot
>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>
>> Is this something we should open a PR to add to Iceberg? It seems a
>> little strange to make it appear that a commit has succeeded, but not
>> actually change a table, which is why we didn’t submit it before now.
>>
>> Thanks,
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

I would also support adding this to Iceberg itself. I think we have a use case where we can leverage this.

@Ryan, could you also provide more info on the audit process?

Thanks,
Anton

> On 20 Jul 2019, at 04:01, RD <rd...@gmail.com> wrote:
> 
> I think this could be useful. When we ingest data from Kafka, we do a predefined set of checks on the data. We can potentially utilize something like this to check for sanity before publishing.  
> 
> How is the auditing process suppose to find the new snapshot , since it is not accessible from the table. Is it by convention?
> 
> -R 
> 
> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Hi everyone,
> 
> At Netflix, we have a pattern for building ETL jobs where we write data, then audit the result before publishing the data that was written to a final table. We call this WAP for write, audit, publish.
> 
> We’ve added support in our Iceberg branch. A WAP write creates a new table snapshot, but doesn’t make that snapshot the current version of the table. Instead, a separate process audits the new snapshot and updates the table’s current snapshot when the audits succeed. I wasn’t sure that this would be useful anywhere else until we talked to another company this week that is interested in the same thing. So I wanted to check whether this is a good feature to include in Iceberg itself.
> 
> This works by staging a snapshot. Basically, Spark writes data as expected, but Iceberg detects that it should not update the table’s current stage. That happens when there is a Spark property, spark.wap.id <http://spark.wap.id/>, that indicates the job is a WAP job. Then any table that has WAP enabled by the table property write.wap.enabled=true will stage the new snapshot instead of fully committing, with the WAP ID in the snapshot’s metadata.
> 
> Is this something we should open a PR to add to Iceberg? It seems a little strange to make it appear that a commit has succeeded, but not actually change a table, which is why we didn’t submit it before now.
> 
> Thanks,
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: [DISCUSS] Write-audit-publish support

Posted by RD <rd...@gmail.com>.

I think this could be useful. When we ingest data from Kafka, we do a
predefined set of checks on the data. We can potentially utilize something
like this to check for sanity before publishing.

How is the auditing process suppose to find the new snapshot , since it is
not accessible from the table. Is it by convention?

-R

On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi everyone,
>
> At Netflix, we have a pattern for building ETL jobs where we write data,
> then audit the result before publishing the data that was written to a
> final table. We call this WAP for write, audit, publish.
>
> We’ve added support in our Iceberg branch. A WAP write creates a new table
> snapshot, but doesn’t make that snapshot the current version of the table.
> Instead, a separate process audits the new snapshot and updates the table’s
> current snapshot when the audits succeed. I wasn’t sure that this would be
> useful anywhere else until we talked to another company this week that is
> interested in the same thing. So I wanted to check whether this is a good
> feature to include in Iceberg itself.
>
> This works by staging a snapshot. Basically, Spark writes data as
> expected, but Iceberg detects that it should not update the table’s current
> stage. That happens when there is a Spark property, spark.wap.id, that
> indicates the job is a WAP job. Then any table that has WAP enabled by the
> table property write.wap.enabled=true will stage the new snapshot instead
> of fully committing, with the WAP ID in the snapshot’s metadata.
>
> Is this something we should open a PR to add to Iceberg? It seems a little
> strange to make it appear that a commit has succeeded, but not actually
> change a table, which is why we didn’t submit it before now.
>
> Thanks,
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>