You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Juan Romero <js...@gmail.com> on 2023/07/14 14:24:43 UTC

Disaster recovery in dataflow with kafka connector

Hello, I'm trying to identify what is the workaround given different error
scenarios when I'm reading from kafka in apache beam on google dataflow.

1) from [1] it says "Dataflow also does have in-place pipeline update that
restores the persisted checkpoints from one pipeline to another" --> that
means that the checkpointing from kafka read from one job can only be used
in another job if we make an update operation from a previous job. Can you
please confirm if this is correct? (checkpointing is automatically handled
when we update a google dataflow job, so the checkpoint of the previous job
is propagated to the update job)

2) How can we handle an extreme situation when for some reason the job can
not be updated and be shut down (example: undesired cancellation of
dataflow job, not drain, a cancel job). On the cancel operation, the
dataflow will start to shut down the machines and the elements that on this
exact moment on the pcollections are messages that will not be processed,
but has been read from kafka and be committed.
Giving that scenario, how can we deploy a new job that can now the status
of the canceled job (we can not use the --update operation because the job
has been canceled and is not running any more) and can determine which
messages has to pull from kafka, what messages has been fully processed (so
not process again to avoid duplicates), and which messages were pulled on
the previous canceled job but wasn't processed jet, so we need to process
it. I don't see how we can specify a checkpoint from a previous job to
another job.


thanks for the help.

[1] https://www.mail-archive.com/dev@beam.apache.org/msg23646.html