You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Basil Bibi <ba...@humn.ai> on 2022/01/06 15:23:15 UTC

Job stuck in savePoint - entire topic replayed on restart.

Hi,
We experienced a problem in production during a release.
Our application is deployed to kubernetes using argocd and uses the Lyft flink operator.
We tried to do a release and found that on deleting the application some of the jobs became stuck in "savepointing" phase.
The only way we could stop these stuck jobs was to patch the finalizers.
We deployed the new release and on startup our application had lost it's offsets so all of the messages in kafka were replayed.
Has anyone got any ideas how and why this happened and how we avoid it in the future?
Sincerely Basil Bibi



Authorised and regulated by the Financial Conduct Authority<https://register.fca.org.uk/> (FCA) number 923700. Humn.ai Ltd, 12 Hammersmith Grove, London, W6 7AP is a registered company number 11032616 incorporated in the United Kingdom. Registered with the information commissioner's office (ICO) number ZA504331.

This message contains confidential information and is intended only for the individual(s) addressed in the message. If you aren't the named addressee, you should not disseminate, distribute, or copy this e-mail.

Re: Job stuck in savePoint - entire topic replayed on restart.

Posted by Piotr Nowojski <pn...@apache.org>.

Hi Basil,

1. What do you mean by:
> The only way we could stop these stuck jobs was to patch the finalizers.
?
2. Do you mean that your job is stuck when doing stop-with-savepoint?
3. What Flink version are you using? Have you tried upgrading to the most
recent version, or at least the most recent minor release? There have been
some bugs in the past with stop-with-savepoint, that have been fixed over
time. For example [1], [2] or [3]. Note that some of them might not be
related to your use case (Kinesis consumer or FLIP-27 sources).
4. If upgrading won't help, can you post stack traces of task managers that
contain the stuck operators/tasks?
5. If you are working on a version that has fixed all of those bugs, are
you using some custom operators/sources/sinks? If your code is either
capturing interrupts, or doing some blocking calls, it might be prone to
bugs similar to [2] (please check the discussion in the ticket for more
information).

Best,
Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-21028
[2] https://issues.apache.org/jira/browse/FLINK-17170
[3] https://issues.apache.org/jira/browse/FLINK-21133

czw., 6 sty 2022 o 16:23 Basil Bibi <ba...@humn.ai> napisał(a):

> Hi,
> We experienced a problem in production during a release.
> Our application is deployed to kubernetes using argocd and uses the Lyft
> flink operator.
> We tried to do a release and found that on deleting the application some
> of the jobs became stuck in "savepointing" phase.
> The only way we could stop these stuck jobs was to patch the finalizers.
> We deployed the new release and on startup our application had lost it's
> offsets so all of the messages in kafka were replayed.
> Has anyone got any ideas how and why this happened and how we avoid it in
> the future?
> Sincerely Basil Bibi
>
>
>
> Authorised and regulated by the Financial Conduct Authority
> <https://register.fca.org.uk/> (FCA) number 923700. *Humn.ai Ltd*, 12
> Hammersmith Grove, London, W6 7AP is a registered company number 11032616
> incorporated in the United Kingdom. Registered with the information
> commissioner’s office (ICO) number ZA504331.
>
> This message contains confidential information and is intended only for
> the individual(s) addressed in the message. If you aren't the named
> addressee, you should not disseminate, distribute, or copy this e-mail.
>