You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by 1 <li...@126.com> on 2021/08/13 09:40:47 UTC

Identify watermark in the iceberg table properties

Hi,all:


  I need to embed the iceberg table, which is regarded as real-time table, into our workflow. That is to say, Flink writes data into Iceberg table in real-time, I need something to indicate the data completeness on the ingestion path so that downstream batch consumer jobs can be triggered when data is complete for a window (like hourly). Like https://github.com/apache/iceberg/pull/2109.


  I see that netflix has done related work on this, Is there any doc or patch for implementation?


  Thx


| |
liubo07199
|
|


|

Re: Identify watermark in the iceberg table properties

Posted by Steven Wu <st...@gmail.com>.

I am also in favor of supporting publishing the Flink watermark as snapshot
metadata in Flink sink by the committer operator.  `IcebergFilesCommitter`
can override the `public void processWatermark(Watermark mark)` to
intercept the latest watermark value.

There was also a recent discussion (FLIP-167) in the Flink community [1] on
this topic. It doesn't affect the Flink Iceberg sink at the moment, as it
addresses the two scenarios that don't apply to FlinSink yet.
-  the old `SinkFunction` API. FlinkSink doesn't use the old `SinkFunction`
API.
- new FLIP-143 Unified Sink API. FlinkSink hasn't moved to the new sink API
yet.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-167%3A+Watermarks+for+Sink+API
.

On Sun, Aug 22, 2021 at 11:42 AM Ryan Blue <bl...@tabular.io> wrote:

> I can comment on how Netflix did this. We added a watermark for arrival
> time to each commit, produced from the lowest processing time across
> partitions of the incoming data. That was added to the snapshot metadata
> for each commit, along with relevant context like the processing region
> because we had multiple incoming jobs. Then we had a metadata service that
> aggregated the min across all processing regions to produce a watermark for
> the table that could be queried and triggered from.
>
> I'm not entirely sure how this was implemented in Flink, but it worked
> well and I'd support adding the snapshot watermark to the current Flink
> sink to enable this pattern.
>
> On Tue, Aug 17, 2021 at 7:06 PM Peidian Li <li...@gmail.com> wrote:
>
>>
>> +1， we haven the same needs, hope the solution to this problem. Thanks.
>
>
>
> --
> Ryan Blue
> Tabular
>

Re: Identify watermark in the iceberg table properties

Posted by Ryan Blue <bl...@tabular.io>.

I can comment on how Netflix did this. We added a watermark for arrival
time to each commit, produced from the lowest processing time across
partitions of the incoming data. That was added to the snapshot metadata
for each commit, along with relevant context like the processing region
because we had multiple incoming jobs. Then we had a metadata service that
aggregated the min across all processing regions to produce a watermark for
the table that could be queried and triggered from.

I'm not entirely sure how this was implemented in Flink, but it worked well
and I'd support adding the snapshot watermark to the current Flink sink to
enable this pattern.

On Tue, Aug 17, 2021 at 7:06 PM Peidian Li <li...@gmail.com> wrote:

>
> +1， we haven the same needs, hope the solution to this problem. Thanks.

-- 
Ryan Blue
Tabular

Re: Identify watermark in the iceberg table properties

Posted by Peidian Li <li...@gmail.com>.

+1， we haven the same needs, hope the solution to this problem. Thanks.

Re: Identify watermark in the iceberg table properties

Posted by Anjali Norwood <an...@netflix.com.INVALID>.

+ Sundaram, as he may have some input.

regards,
Anjali.

On Fri, Aug 13, 2021 at 2:41 AM 1 <li...@126.com> wrote:

> Hi,all:
>
>   I need to embed the iceberg table, which is regarded as real-time
> table, into our workflow. That is to say, Flink writes data into Iceberg
> table in real-time, I need something to indicate the data completeness on
> the ingestion path so that downstream batch consumer jobs can be triggered
> when data is complete for a window (like hourly). Like
> https://github.com/apache/iceberg/pull/2109.
>
>   I see that netflix has done related work on this, Is there any doc or
> patch for implementation?
>
>   Thx
>
> liubo07199
>
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=liubo07199&uid=liubo07199%40hellobike.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22liubo07199%40hellobike.com%22%5D>
>
>