You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Lakshmi Rao <gl...@gmail.com> on 2019/11/15 19:53:20 UTC

Flink Iceberg sink pointers?

Hi,

I'm working on building a POC of streaming data with Flink to Iceberg for a
hackathon project. I know this issue is still open
https://github.com/apache/incubator-iceberg/issues/567 . I'm pretty excited
for the work mentioned in the issue to be open sourced and would be happy
to contribute to any tasks or tickets related to this issue!

However, in the meantime, I'd to get a simple working version for a
flink-iceberg sink and generally explore Iceberg more. Any pointers on how
to get started? I saw this PR that enabled sinking to Iceberg with spark
structured streaming: https://github.com/apache/incubator-iceberg/pull/228 Are
there any other pointers the community can provide?

Thanks
Lakshmi

Re: Flink Iceberg sink pointers?

Posted by Lakshmi Rao <gl...@gmail.com>.
Thanks for the response! The outline is very helpful. I'll use this as a
starting point.

-
Lakshmi

On Fri, Nov 15, 2019 at 12:49 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Hi Lakshmi,
>
> +Steven Wu <st...@netflix.com>, who wrote our Flink sink.
>
> I can tell you a bit about how our sink works, which will hopefully help.
> Ours accumulates data in data files until a snapshot. When that happens,
> each writer closes its open data files and sends a DataFile instance to the
> next stage, which is a single commit task responsible for committing all
> the data files to the Iceberg table. When the commit task gets the
> notification from each writer, it prepares a commit by writing a new
> manifest of all the data files. Then it stages that commit information in
> the checkpoint. When the checkpoint succeeds, the committer commits to the
> Iceberg table. If the Iceberg commit fails, the new manifests will stack up
> and no data is lost. When the committer is running the Iceberg commit, it
> checks what previous checkpoints have already been committed in recent
> Iceberg snapshots (using an ID from each flink snapshot stored in the
> Iceberg summary) for exactly-once commits.
>
> Steven can probably explain it better, but that's a rough outline.
>
> rb
>
> On Fri, Nov 15, 2019 at 11:53 AM Lakshmi Rao <gl...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm working on building a POC of streaming data with Flink to Iceberg for
>> a hackathon project. I know this issue is still open
>> https://github.com/apache/incubator-iceberg/issues/567 . I'm pretty
>> excited for the work mentioned in the issue to be open sourced and would be
>> happy to contribute to any tasks or tickets related to this issue!
>>
>> However, in the meantime, I'd to get a simple working version for a
>> flink-iceberg sink and generally explore Iceberg more. Any pointers on how
>> to get started? I saw this PR that enabled sinking to Iceberg with spark
>> structured streaming:
>> https://github.com/apache/incubator-iceberg/pull/228 Are there any other
>> pointers the community can provide?
>>
>> Thanks
>> Lakshmi
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Lakshmi
Graduate Student
University of Illinois Urbana-Champaign

Re: Flink Iceberg sink pointers?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Lakshmi,

+Steven Wu <st...@netflix.com>, who wrote our Flink sink.

I can tell you a bit about how our sink works, which will hopefully help.
Ours accumulates data in data files until a snapshot. When that happens,
each writer closes its open data files and sends a DataFile instance to the
next stage, which is a single commit task responsible for committing all
the data files to the Iceberg table. When the commit task gets the
notification from each writer, it prepares a commit by writing a new
manifest of all the data files. Then it stages that commit information in
the checkpoint. When the checkpoint succeeds, the committer commits to the
Iceberg table. If the Iceberg commit fails, the new manifests will stack up
and no data is lost. When the committer is running the Iceberg commit, it
checks what previous checkpoints have already been committed in recent
Iceberg snapshots (using an ID from each flink snapshot stored in the
Iceberg summary) for exactly-once commits.

Steven can probably explain it better, but that's a rough outline.

rb

On Fri, Nov 15, 2019 at 11:53 AM Lakshmi Rao <gl...@gmail.com> wrote:

> Hi,
>
> I'm working on building a POC of streaming data with Flink to Iceberg for
> a hackathon project. I know this issue is still open
> https://github.com/apache/incubator-iceberg/issues/567 . I'm pretty
> excited for the work mentioned in the issue to be open sourced and would be
> happy to contribute to any tasks or tickets related to this issue!
>
> However, in the meantime, I'd to get a simple working version for a
> flink-iceberg sink and generally explore Iceberg more. Any pointers on how
> to get started? I saw this PR that enabled sinking to Iceberg with spark
> structured streaming: https://github.com/apache/incubator-iceberg/pull/228 Are
> there any other pointers the community can provide?
>
> Thanks
> Lakshmi
>


-- 
Ryan Blue
Software Engineer
Netflix