You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Will Baker <wb...@estuary.dev> on 2022/08/29 21:24:18 UTC

Checkpointing on Google Cloud Dataflow Runner

Hello!

I am wondering about using checkpoints with Beam running on Google
Cloud Dataflow.

The docs indicate that checkpoints are not supported by Google Cloud
Dataflow:  https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/

Is there a recommended approach to handling checkpointing on Google
Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
that a pipeline could be resumed from where it left off if it needs to
be stopped or crashes for some reason?

Thanks!
Will Baker

Re: Checkpointing on Google Cloud Dataflow Runner

Posted by Will Baker <wb...@estuary.dev>.
I looked into snapshots and they do seem useful for providing a means
to save state and resume, however they aren't as seamless as I was
hoping for with the automatic checkpointing that is supported by other
runners. It looked like snapshots would be user initiated and would
pause the pipeline while the snapshot was being created. I could
imagine how this would be set up on an automated schedule, but would
still prefer something more light-weight like checkpoints.

On Mon, Aug 29, 2022 at 8:11 PM Reuven Lax <re...@google.com> wrote:
>
> Google Cloud Dataflow does support snapshots. Is this what you were looking for?
>
> On Mon, Aug 29, 2022 at 4:04 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>> Hi Will, David,
>>
>> I think you'll find the best source of answer for this sort of question on the user@beam list. I've put that in the To: line with a BCC: to the dev@beam list so everyone knows they can find the thread there. If I have misunderstood, and your question has to do with building Beam itself, feel free to move it back.
>>
>> Kenn
>>
>> On Mon, Aug 29, 2022 at 2:24 PM Will Baker <wb...@estuary.dev> wrote:
>>>
>>> Hello!
>>>
>>> I am wondering about using checkpoints with Beam running on Google
>>> Cloud Dataflow.
>>>
>>> The docs indicate that checkpoints are not supported by Google Cloud
>>> Dataflow:  https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
>>>
>>> Is there a recommended approach to handling checkpointing on Google
>>> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
>>> that a pipeline could be resumed from where it left off if it needs to
>>> be stopped or crashes for some reason?
>>>
>>> Thanks!
>>> Will Baker

Re: Checkpointing on Google Cloud Dataflow Runner

Posted by Reuven Lax via user <us...@beam.apache.org>.
Snapshots are expected to happen nearly instantaneously. While processing
is paused while the snapshot is in progress, the pause should usually be
very brief. It's true that Dataflow does not support automated snapshots -
you would have to create them yourself using a cron.

Checkpoints on Flink aren't simply automated snapshot mechanism.
Checkpoints are how Flink implements consistent, exactly-once processing.
Dataflow on the other hand continuously checkpoints records, so doesn't
need global checkpoints for exactly-once processing.

Reuven

On Tue, Aug 30, 2022 at 5:10 AM Will Baker <wb...@estuary.dev> wrote:

> I looked into snapshots and they do seem useful for providing a means
> to save state and resume, however they aren't as seamless as I was
> hoping for with the automatic checkpointing that is supported by other
> runners. It looked like snapshots would be user initiated and would
> pause the pipeline while the snapshot was being created. I could
> imagine how this would be set up on an automated schedule, but would
> still prefer something more light-weight like checkpoints.
>
> On Mon, Aug 29, 2022 at 8:11 PM Reuven Lax <re...@google.com> wrote:
> >
> > Google Cloud Dataflow does support snapshots. Is this what you were
> looking for?
> >
> > On Mon, Aug 29, 2022 at 4:04 PM Kenneth Knowles <ke...@apache.org> wrote:
> >>
> >> Hi Will, David,
> >>
> >> I think you'll find the best source of answer for this sort of question
> on the user@beam list. I've put that in the To: line with a BCC: to the
> dev@beam list so everyone knows they can find the thread there. If I have
> misunderstood, and your question has to do with building Beam itself, feel
> free to move it back.
> >>
> >> Kenn
> >>
> >> On Mon, Aug 29, 2022 at 2:24 PM Will Baker <wb...@estuary.dev> wrote:
> >>>
> >>> Hello!
> >>>
> >>> I am wondering about using checkpoints with Beam running on Google
> >>> Cloud Dataflow.
> >>>
> >>> The docs indicate that checkpoints are not supported by Google Cloud
> >>> Dataflow:
> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
> >>>
> >>> Is there a recommended approach to handling checkpointing on Google
> >>> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
> >>> that a pipeline could be resumed from where it left off if it needs to
> >>> be stopped or crashes for some reason?
> >>>
> >>> Thanks!
> >>> Will Baker
>

Re: Checkpointing on Google Cloud Dataflow Runner

Posted by Reuven Lax via user <us...@beam.apache.org>.
Google Cloud Dataflow does support snapshots
<https://cloud.google.com/dataflow/docs/guides/using-snapshots>. Is this
what you were looking for?

On Mon, Aug 29, 2022 at 4:04 PM Kenneth Knowles <ke...@apache.org> wrote:

> Hi Will, David,
>
> I think you'll find the best source of answer for this sort of question on
> the user@beam list. I've put that in the To: line with a BCC: to the
> dev@beam list so everyone knows they can find the thread there. If I have
> misunderstood, and your question has to do with building Beam itself, feel
> free to move it back.
>
> Kenn
>
> On Mon, Aug 29, 2022 at 2:24 PM Will Baker <wb...@estuary.dev> wrote:
>
>> Hello!
>>
>> I am wondering about using checkpoints with Beam running on Google
>> Cloud Dataflow.
>>
>> The docs indicate that checkpoints are not supported by Google Cloud
>> Dataflow:
>> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
>>
>> Is there a recommended approach to handling checkpointing on Google
>> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
>> that a pipeline could be resumed from where it left off if it needs to
>> be stopped or crashes for some reason?
>>
>> Thanks!
>> Will Baker
>>
>

Re: Checkpointing on Google Cloud Dataflow Runner

Posted by Kenneth Knowles <ke...@apache.org>.
Hi Will, David,

I think you'll find the best source of answer for this sort of question on
the user@beam list. I've put that in the To: line with a BCC: to the
dev@beam list so everyone knows they can find the thread there. If I have
misunderstood, and your question has to do with building Beam itself, feel
free to move it back.

Kenn

On Mon, Aug 29, 2022 at 2:24 PM Will Baker <wb...@estuary.dev> wrote:

> Hello!
>
> I am wondering about using checkpoints with Beam running on Google
> Cloud Dataflow.
>
> The docs indicate that checkpoints are not supported by Google Cloud
> Dataflow:
> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
>
> Is there a recommended approach to handling checkpointing on Google
> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
> that a pipeline could be resumed from where it left off if it needs to
> be stopped or crashes for some reason?
>
> Thanks!
> Will Baker
>

Re: Checkpointing on Google Cloud Dataflow Runner

Posted by Kenneth Knowles <ke...@apache.org>.
Hi Will, David,

I think you'll find the best source of answer for this sort of question on
the user@beam list. I've put that in the To: line with a BCC: to the
dev@beam list so everyone knows they can find the thread there. If I have
misunderstood, and your question has to do with building Beam itself, feel
free to move it back.

Kenn

On Mon, Aug 29, 2022 at 2:24 PM Will Baker <wb...@estuary.dev> wrote:

> Hello!
>
> I am wondering about using checkpoints with Beam running on Google
> Cloud Dataflow.
>
> The docs indicate that checkpoints are not supported by Google Cloud
> Dataflow:
> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
>
> Is there a recommended approach to handling checkpointing on Google
> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
> that a pipeline could be resumed from where it left off if it needs to
> be stopped or crashes for some reason?
>
> Thanks!
> Will Baker
>