You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Jordan Hurwich <jh...@pulsasensors.com> on 2021/09/05 18:01:57 UTC

Flink jobs stuck in "RESTARTING" State, Jobmanager thrashing on "Could not fulfill resource requirements"

Hi Flink Users,
We're having an issue where Flink jobs get stuck in the "RESTARTING" state
reliably the first night after our deployments with logs from the
Jobmanager reporting "Could not fulfill resource requirements of job
<job-id>. Free slots: 0". This is surprising because we expect the
restarting job to reuse its existing slot, and because even after
increasing from our minimum required 2 slots to 6 this issue appears to
still occur though $ flink list indicates only 2 jobs.

Manually stopping all jobs and the jobmanager, and restarting each,
reliably addresses the issue and it does not appear to occur again. We
suspect some currently unknown root cause, but are particularly interested
in ensuring the Flink jobs can automatically recover from this state
without manual intervention.

Details below with logs and flink-conf.yaml attached. Your support and
feedback is greatly appreciated,
Jordan, Pulsa Inc

   - $ flink --version: Version: 1.13.2, Commit ID: 5f007ff

   - What we observe
   - In this example we observe the issue beginning at about 2021-09-05
      03:20:34, logs for "standalonesession" and "taskexecutor" for
the relevant
      time period attached, more available on request.
      - The earliest unique indication of the issue (at 2021-09-05
      03:20:34) appears to be a failure for one of our two jobs ("Measure") to
      open a Postgres connection, the other "Signal" job does not rely on
      Postgres. Logs attached in taskexecutor.out
         - This occurs in the open() of our custom RichSinkFunction
         "PGSink", where the exception is caught and we call
         Runtime.getRuntime().exit(1)
      - Moments before the issue starts (at 2021-09-05 03:19:19) we observe
      the TaskManager canceling the previous instance of this job, as indicated
      in taskexecutor.log
      - This corresponds with logs in standalonesession.log of "Trying to
         recover from a global failure...FlinkRuntimeException:
Exceeded checkpoint
         tolerable failure threshold"
         - standalonesession.log indicates this exception happens multiple
         times before this event but in the earlier cases automatic
recover seems to
         work correctly.
      - Moments after the issue starts (at 2021-09-05 03:20:37) in
      standalonesession.log we see another occurrence of "Exceeded
      checkpoint tolerable failure threshold" with considerably different
      following logs than the previous.
         - At about this time and following we observe logs in
         standalongsession.log reporting "akka...NettyTransport...Remote
         connection to [null] failed with java.net.ConnectException: Connection
         refused: /172.30.172.30:40763" repeatedly
      - In standalonesession.log we then see one occurrence of
"...TimeoutException:
      Heartbeat of TaskManager with id 172.30.172.30:40763-75a3ae timed out"
      (at 2021-09-05 03:21:19) - this exception never appears to be repeated.
      - Immediately following that exception the repeating failure loop
      seems to begin:
         - The first log of the loop, in standalonesession.log, appears to
         be "Discarding the results produced by task execution
905e1..." 905e1...
         corresponds with the "TimeoutException: Heartbeat..." exception
         - That last log before repeating is "...CompletionException:
         ...NoResourceAvailableException: Could not acquire the
minimum required
         resources", and this line will be followed "Discarding the
         results..." with the hash id matching this exception's.
         - This loop then repeats indefinitely.
      - During the issue, $ flink list on each machine reports the
      following:
      ------------------ Running/Restarting Jobs -------------------
      04.09.2021 16:03:39 : 3143a6c3ab0a5f58130384c13b2eca45 :
      FlinkSignalProcessor (RESTARTING)
      04.09.2021 16:45:08 : 4d531101945f0726432f73f37a38ee48 :
      FlinkMeasureProcessor (RESTARTING)


   - When we observe the issue
      - On each deployment we use Terraform to deploy a cluster of "funnel"
      machines on AWS EC2, each running two Flink jobs, "...Measure..." and
      "...Signal...", that pull from some upstream RabbitMQ queue and
output into
      our Postgres db and Redis cache respectively.
      - After deployment we always observe these jobs are functioning
      properly in the RUNNING state on each machine, but some time the night
      after deployment we observe that the jobs become nonfunctional, reporting
      RESTARTING indefinitely.
      - This appears to occur to both jobs on every machine in the cluster
      simultaneously, implying some currently unknown root cause.
      - Stopping the jobs and jobmanager on each machine, and restarting
      them, reliably fixes the issue and it does not occur again after
that first
      night following deployment.

Re: Flink jobs stuck in "RESTARTING" State, Jobmanager thrashing on "Could not fulfill resource requirements"

Posted by Jordan Hurwich <jh...@pulsasensors.com>.

Thanks for the help Arvid. Going to make these changes now and expect this
will address the issue we've had.

On Mon, Sep 6, 2021 at 12:03 PM Arvid Heise <ar...@apache.org> wrote:

> Sorry for not immediately writing that: yes you should just throw an
> exception. Flink will then try to close everything and restarts the job if
> the job is configured for restarts. [1]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/
>
> On Mon, Sep 6, 2021 at 6:59 PM Jordan Hurwich <jh...@pulsasensors.com>
> wrote:
>
>> Thanks very much for the quick response, Arvid.
>>
>> I'm removing our calls to Runtime.getRuntime().exit(1) now. If something
>> goes wrong when setting up the task, what's the preferred way to graceful
>> shutdown so the Taskmanager can restart the job successfully? Should I just
>> allow an Exception to propagate up, or is there another method I should
>> call instead of exit()?
>>
>> On Mon, Sep 6, 2021 at 9:41 AM Arvid Heise <ar...@apache.org> wrote:
>>
>>> Hi Jordan,
>>>
>>> Please never call Runtime.getRuntime().exit(1). This will rob Flink of
>>> any way of graceful shutdown and recovery. Since you mentioned EC2 (and not
>>> a managed service such as ECS or K8S), I'm also assuming that the task
>>> manager process is never restarted and that's why you always end up with
>>> too few resources.
>>>
>>> If you have any other way of restarting the task manager, please let me
>>> know and I would look more closely into the logs.
>>>
>>>
>>>
>>> On Sun, Sep 5, 2021 at 8:02 PM Jordan Hurwich <jh...@pulsasensors.com>
>>> wrote:
>>>
>>>> Hi Flink Users,
>>>> We're having an issue where Flink jobs get stuck in the "RESTARTING"
>>>> state reliably the first night after our deployments with logs from the
>>>> Jobmanager reporting "Could not fulfill resource requirements of job
>>>> <job-id>. Free slots: 0". This is surprising because we expect the
>>>> restarting job to reuse its existing slot, and because even after
>>>> increasing from our minimum required 2 slots to 6 this issue appears to
>>>> still occur though $ flink list indicates only 2 jobs.
>>>>
>>>> Manually stopping all jobs and the jobmanager, and restarting each,
>>>> reliably addresses the issue and it does not appear to occur again. We
>>>> suspect some currently unknown root cause, but are particularly interested
>>>> in ensuring the Flink jobs can automatically recover from this state
>>>> without manual intervention.
>>>>
>>>> Details below with logs and flink-conf.yaml attached. Your support and
>>>> feedback is greatly appreciated,
>>>> Jordan, Pulsa Inc
>>>>
>>>>    - $ flink --version: Version: 1.13.2, Commit ID: 5f007ff
>>>>
>>>>    - What we observe
>>>>    - In this example we observe the issue beginning at about
>>>>       2021-09-05 03:20:34, logs for "standalonesession" and "taskexecutor" for
>>>>       the relevant time period attached, more available on request.
>>>>       - The earliest unique indication of the issue (at 2021-09-05
>>>>       03:20:34) appears to be a failure for one of our two jobs ("Measure") to
>>>>       open a Postgres connection, the other "Signal" job does not rely on
>>>>       Postgres. Logs attached in taskexecutor.out
>>>>          - This occurs in the open() of our custom RichSinkFunction
>>>>          "PGSink", where the exception is caught and we call
>>>>          Runtime.getRuntime().exit(1)
>>>>       - Moments before the issue starts (at 2021-09-05 03:19:19) we
>>>>       observe the TaskManager canceling the previous instance of this job, as
>>>>       indicated in taskexecutor.log
>>>>       - This corresponds with logs in standalonesession.log of "Trying
>>>>          to recover from a global failure...FlinkRuntimeException: Exceeded
>>>>          checkpoint tolerable failure threshold"
>>>>          - standalonesession.log indicates this exception happens
>>>>          multiple times before this event but in the earlier cases automatic recover
>>>>          seems to work correctly.
>>>>       - Moments after the issue starts (at 2021-09-05 03:20:37) in
>>>>       standalonesession.log we see another occurrence of "Exceeded
>>>>       checkpoint tolerable failure threshold" with considerably
>>>>       different following logs than the previous.
>>>>          - At about this time and following we observe logs in
>>>>          standalongsession.log reporting "akka...NettyTransport...Remote
>>>>          connection to [null] failed with java.net.ConnectException: Connection
>>>>          refused: /172.30.172.30:40763" repeatedly
>>>>       - In standalonesession.log we then see one occurrence of "...TimeoutException:
>>>>       Heartbeat of TaskManager with id 172.30.172.30:40763-75a3ae
>>>>       timed out" (at 2021-09-05 03:21:19) - this exception never
>>>>       appears to be repeated.
>>>>       - Immediately following that exception the repeating failure
>>>>       loop seems to begin:
>>>>          - The first log of the loop, in standalonesession.log,
>>>>          appears to be "Discarding the results produced by task
>>>>          execution 905e1..." 905e1... corresponds with the "TimeoutException:
>>>>          Heartbeat..." exception
>>>>          - That last log before repeating is "...CompletionException:
>>>>          ...NoResourceAvailableException: Could not acquire the minimum required
>>>>          resources", and this line will be followed "Discarding the
>>>>          results..." with the hash id matching this exception's.
>>>>          - This loop then repeats indefinitely.
>>>>       - During the issue, $ flink list on each machine reports the
>>>>       following:
>>>>       ------------------ Running/Restarting Jobs -------------------
>>>>       04.09.2021 16:03:39 : 3143a6c3ab0a5f58130384c13b2eca45 :
>>>>       FlinkSignalProcessor (RESTARTING)
>>>>       04.09.2021 16:45:08 : 4d531101945f0726432f73f37a38ee48 :
>>>>       FlinkMeasureProcessor (RESTARTING)
>>>>
>>>>
>>>>    - When we observe the issue
>>>>       - On each deployment we use Terraform to deploy a cluster of
>>>>       "funnel" machines on AWS EC2, each running two Flink jobs, "...Measure..."
>>>>       and "...Signal...", that pull from some upstream RabbitMQ queue and output
>>>>       into our Postgres db and Redis cache respectively.
>>>>       - After deployment we always observe these jobs are functioning
>>>>       properly in the RUNNING state on each machine, but some time the night
>>>>       after deployment we observe that the jobs become nonfunctional, reporting
>>>>       RESTARTING indefinitely.
>>>>       - This appears to occur to both jobs on every machine in the
>>>>       cluster simultaneously, implying some currently unknown root cause.
>>>>       - Stopping the jobs and jobmanager on each machine, and
>>>>       restarting them, reliably fixes the issue and it does not occur again after
>>>>       that first night following deployment.
>>>>
>>>>

Re: Flink jobs stuck in "RESTARTING" State, Jobmanager thrashing on "Could not fulfill resource requirements"

Posted by Arvid Heise <ar...@apache.org>.

Sorry for not immediately writing that: yes you should just throw an
exception. Flink will then try to close everything and restarts the job if
the job is configured for restarts. [1]

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/

On Mon, Sep 6, 2021 at 6:59 PM Jordan Hurwich <jh...@pulsasensors.com>
wrote:

> Thanks very much for the quick response, Arvid.
>
> I'm removing our calls to Runtime.getRuntime().exit(1) now. If something
> goes wrong when setting up the task, what's the preferred way to graceful
> shutdown so the Taskmanager can restart the job successfully? Should I just
> allow an Exception to propagate up, or is there another method I should
> call instead of exit()?
>
> On Mon, Sep 6, 2021 at 9:41 AM Arvid Heise <ar...@apache.org> wrote:
>
>> Hi Jordan,
>>
>> Please never call Runtime.getRuntime().exit(1). This will rob Flink of
>> any way of graceful shutdown and recovery. Since you mentioned EC2 (and not
>> a managed service such as ECS or K8S), I'm also assuming that the task
>> manager process is never restarted and that's why you always end up with
>> too few resources.
>>
>> If you have any other way of restarting the task manager, please let me
>> know and I would look more closely into the logs.
>>
>>
>>
>> On Sun, Sep 5, 2021 at 8:02 PM Jordan Hurwich <jh...@pulsasensors.com>
>> wrote:
>>
>>> Hi Flink Users,
>>> We're having an issue where Flink jobs get stuck in the "RESTARTING"
>>> state reliably the first night after our deployments with logs from the
>>> Jobmanager reporting "Could not fulfill resource requirements of job
>>> <job-id>. Free slots: 0". This is surprising because we expect the
>>> restarting job to reuse its existing slot, and because even after
>>> increasing from our minimum required 2 slots to 6 this issue appears to
>>> still occur though $ flink list indicates only 2 jobs.
>>>
>>> Manually stopping all jobs and the jobmanager, and restarting each,
>>> reliably addresses the issue and it does not appear to occur again. We
>>> suspect some currently unknown root cause, but are particularly interested
>>> in ensuring the Flink jobs can automatically recover from this state
>>> without manual intervention.
>>>
>>> Details below with logs and flink-conf.yaml attached. Your support and
>>> feedback is greatly appreciated,
>>> Jordan, Pulsa Inc
>>>
>>>    - $ flink --version: Version: 1.13.2, Commit ID: 5f007ff
>>>
>>>    - What we observe
>>>    - In this example we observe the issue beginning at about 2021-09-05
>>>       03:20:34, logs for "standalonesession" and "taskexecutor" for the relevant
>>>       time period attached, more available on request.
>>>       - The earliest unique indication of the issue (at 2021-09-05
>>>       03:20:34) appears to be a failure for one of our two jobs ("Measure") to
>>>       open a Postgres connection, the other "Signal" job does not rely on
>>>       Postgres. Logs attached in taskexecutor.out
>>>          - This occurs in the open() of our custom RichSinkFunction
>>>          "PGSink", where the exception is caught and we call
>>>          Runtime.getRuntime().exit(1)
>>>       - Moments before the issue starts (at 2021-09-05 03:19:19) we
>>>       observe the TaskManager canceling the previous instance of this job, as
>>>       indicated in taskexecutor.log
>>>       - This corresponds with logs in standalonesession.log of "Trying
>>>          to recover from a global failure...FlinkRuntimeException: Exceeded
>>>          checkpoint tolerable failure threshold"
>>>          - standalonesession.log indicates this exception happens
>>>          multiple times before this event but in the earlier cases automatic recover
>>>          seems to work correctly.
>>>       - Moments after the issue starts (at 2021-09-05 03:20:37) in
>>>       standalonesession.log we see another occurrence of "Exceeded
>>>       checkpoint tolerable failure threshold" with considerably
>>>       different following logs than the previous.
>>>          - At about this time and following we observe logs in
>>>          standalongsession.log reporting "akka...NettyTransport...Remote
>>>          connection to [null] failed with java.net.ConnectException: Connection
>>>          refused: /172.30.172.30:40763" repeatedly
>>>       - In standalonesession.log we then see one occurrence of "...TimeoutException:
>>>       Heartbeat of TaskManager with id 172.30.172.30:40763-75a3ae timed
>>>       out" (at 2021-09-05 03:21:19) - this exception never appears to
>>>       be repeated.
>>>       - Immediately following that exception the repeating failure loop
>>>       seems to begin:
>>>          - The first log of the loop, in standalonesession.log, appears
>>>          to be "Discarding the results produced by task execution
>>>          905e1..." 905e1... corresponds with the "TimeoutException:
>>>          Heartbeat..." exception
>>>          - That last log before repeating is "...CompletionException:
>>>          ...NoResourceAvailableException: Could not acquire the minimum required
>>>          resources", and this line will be followed "Discarding the
>>>          results..." with the hash id matching this exception's.
>>>          - This loop then repeats indefinitely.
>>>       - During the issue, $ flink list on each machine reports the
>>>       following:
>>>       ------------------ Running/Restarting Jobs -------------------
>>>       04.09.2021 16:03:39 : 3143a6c3ab0a5f58130384c13b2eca45 :
>>>       FlinkSignalProcessor (RESTARTING)
>>>       04.09.2021 16:45:08 : 4d531101945f0726432f73f37a38ee48 :
>>>       FlinkMeasureProcessor (RESTARTING)
>>>
>>>
>>>    - When we observe the issue
>>>       - On each deployment we use Terraform to deploy a cluster of
>>>       "funnel" machines on AWS EC2, each running two Flink jobs, "...Measure..."
>>>       and "...Signal...", that pull from some upstream RabbitMQ queue and output
>>>       into our Postgres db and Redis cache respectively.
>>>       - After deployment we always observe these jobs are functioning
>>>       properly in the RUNNING state on each machine, but some time the night
>>>       after deployment we observe that the jobs become nonfunctional, reporting
>>>       RESTARTING indefinitely.
>>>       - This appears to occur to both jobs on every machine in the
>>>       cluster simultaneously, implying some currently unknown root cause.
>>>       - Stopping the jobs and jobmanager on each machine, and
>>>       restarting them, reliably fixes the issue and it does not occur again after
>>>       that first night following deployment.
>>>
>>>

Re: Flink jobs stuck in "RESTARTING" State, Jobmanager thrashing on "Could not fulfill resource requirements"

Posted by Jordan Hurwich <jh...@pulsasensors.com>.

Thanks very much for the quick response, Arvid.

I'm removing our calls to Runtime.getRuntime().exit(1) now. If something
goes wrong when setting up the task, what's the preferred way to graceful
shutdown so the Taskmanager can restart the job successfully? Should I just
allow an Exception to propagate up, or is there another method I should
call instead of exit()?

On Mon, Sep 6, 2021 at 9:41 AM Arvid Heise <ar...@apache.org> wrote:

> Hi Jordan,
>
> Please never call Runtime.getRuntime().exit(1). This will rob Flink of
> any way of graceful shutdown and recovery. Since you mentioned EC2 (and not
> a managed service such as ECS or K8S), I'm also assuming that the task
> manager process is never restarted and that's why you always end up with
> too few resources.
>
> If you have any other way of restarting the task manager, please let me
> know and I would look more closely into the logs.
>
>
>
> On Sun, Sep 5, 2021 at 8:02 PM Jordan Hurwich <jh...@pulsasensors.com>
> wrote:
>
>> Hi Flink Users,
>> We're having an issue where Flink jobs get stuck in the "RESTARTING"
>> state reliably the first night after our deployments with logs from the
>> Jobmanager reporting "Could not fulfill resource requirements of job
>> <job-id>. Free slots: 0". This is surprising because we expect the
>> restarting job to reuse its existing slot, and because even after
>> increasing from our minimum required 2 slots to 6 this issue appears to
>> still occur though $ flink list indicates only 2 jobs.
>>
>> Manually stopping all jobs and the jobmanager, and restarting each,
>> reliably addresses the issue and it does not appear to occur again. We
>> suspect some currently unknown root cause, but are particularly interested
>> in ensuring the Flink jobs can automatically recover from this state
>> without manual intervention.
>>
>> Details below with logs and flink-conf.yaml attached. Your support and
>> feedback is greatly appreciated,
>> Jordan, Pulsa Inc
>>
>>    - $ flink --version: Version: 1.13.2, Commit ID: 5f007ff
>>
>>    - What we observe
>>    - In this example we observe the issue beginning at about 2021-09-05
>>       03:20:34, logs for "standalonesession" and "taskexecutor" for the relevant
>>       time period attached, more available on request.
>>       - The earliest unique indication of the issue (at 2021-09-05
>>       03:20:34) appears to be a failure for one of our two jobs ("Measure") to
>>       open a Postgres connection, the other "Signal" job does not rely on
>>       Postgres. Logs attached in taskexecutor.out
>>          - This occurs in the open() of our custom RichSinkFunction
>>          "PGSink", where the exception is caught and we call
>>          Runtime.getRuntime().exit(1)
>>       - Moments before the issue starts (at 2021-09-05 03:19:19) we
>>       observe the TaskManager canceling the previous instance of this job, as
>>       indicated in taskexecutor.log
>>       - This corresponds with logs in standalonesession.log of "Trying
>>          to recover from a global failure...FlinkRuntimeException: Exceeded
>>          checkpoint tolerable failure threshold"
>>          - standalonesession.log indicates this exception happens
>>          multiple times before this event but in the earlier cases automatic recover
>>          seems to work correctly.
>>       - Moments after the issue starts (at 2021-09-05 03:20:37) in
>>       standalonesession.log we see another occurrence of "Exceeded
>>       checkpoint tolerable failure threshold" with considerably
>>       different following logs than the previous.
>>          - At about this time and following we observe logs in
>>          standalongsession.log reporting "akka...NettyTransport...Remote
>>          connection to [null] failed with java.net.ConnectException: Connection
>>          refused: /172.30.172.30:40763" repeatedly
>>       - In standalonesession.log we then see one occurrence of "...TimeoutException:
>>       Heartbeat of TaskManager with id 172.30.172.30:40763-75a3ae timed
>>       out" (at 2021-09-05 03:21:19) - this exception never appears to be
>>       repeated.
>>       - Immediately following that exception the repeating failure loop
>>       seems to begin:
>>          - The first log of the loop, in standalonesession.log, appears
>>          to be "Discarding the results produced by task execution
>>          905e1..." 905e1... corresponds with the "TimeoutException:
>>          Heartbeat..." exception
>>          - That last log before repeating is "...CompletionException:
>>          ...NoResourceAvailableException: Could not acquire the minimum required
>>          resources", and this line will be followed "Discarding the
>>          results..." with the hash id matching this exception's.
>>          - This loop then repeats indefinitely.
>>       - During the issue, $ flink list on each machine reports the
>>       following:
>>       ------------------ Running/Restarting Jobs -------------------
>>       04.09.2021 16:03:39 : 3143a6c3ab0a5f58130384c13b2eca45 :
>>       FlinkSignalProcessor (RESTARTING)
>>       04.09.2021 16:45:08 : 4d531101945f0726432f73f37a38ee48 :
>>       FlinkMeasureProcessor (RESTARTING)
>>
>>
>>    - When we observe the issue
>>       - On each deployment we use Terraform to deploy a cluster of
>>       "funnel" machines on AWS EC2, each running two Flink jobs, "...Measure..."
>>       and "...Signal...", that pull from some upstream RabbitMQ queue and output
>>       into our Postgres db and Redis cache respectively.
>>       - After deployment we always observe these jobs are functioning
>>       properly in the RUNNING state on each machine, but some time the night
>>       after deployment we observe that the jobs become nonfunctional, reporting
>>       RESTARTING indefinitely.
>>       - This appears to occur to both jobs on every machine in the
>>       cluster simultaneously, implying some currently unknown root cause.
>>       - Stopping the jobs and jobmanager on each machine, and restarting
>>       them, reliably fixes the issue and it does not occur again after that first
>>       night following deployment.
>>
>>

Re: Flink jobs stuck in "RESTARTING" State, Jobmanager thrashing on "Could not fulfill resource requirements"

Posted by Arvid Heise <ar...@apache.org>.

Hi Jordan,

Please never call Runtime.getRuntime().exit(1). This will rob Flink of any
way of graceful shutdown and recovery. Since you mentioned EC2 (and not a
managed service such as ECS or K8S), I'm also assuming that the task
manager process is never restarted and that's why you always end up with
too few resources.

If you have any other way of restarting the task manager, please let me
know and I would look more closely into the logs.



On Sun, Sep 5, 2021 at 8:02 PM Jordan Hurwich <jh...@pulsasensors.com>
wrote:

> Hi Flink Users,
> We're having an issue where Flink jobs get stuck in the "RESTARTING" state
> reliably the first night after our deployments with logs from the
> Jobmanager reporting "Could not fulfill resource requirements of job
> <job-id>. Free slots: 0". This is surprising because we expect the
> restarting job to reuse its existing slot, and because even after
> increasing from our minimum required 2 slots to 6 this issue appears to
> still occur though $ flink list indicates only 2 jobs.
>
> Manually stopping all jobs and the jobmanager, and restarting each,
> reliably addresses the issue and it does not appear to occur again. We
> suspect some currently unknown root cause, but are particularly interested
> in ensuring the Flink jobs can automatically recover from this state
> without manual intervention.
>
> Details below with logs and flink-conf.yaml attached. Your support and
> feedback is greatly appreciated,
> Jordan, Pulsa Inc
>
>    - $ flink --version: Version: 1.13.2, Commit ID: 5f007ff
>
>    - What we observe
>    - In this example we observe the issue beginning at about 2021-09-05
>       03:20:34, logs for "standalonesession" and "taskexecutor" for the relevant
>       time period attached, more available on request.
>       - The earliest unique indication of the issue (at 2021-09-05
>       03:20:34) appears to be a failure for one of our two jobs ("Measure") to
>       open a Postgres connection, the other "Signal" job does not rely on
>       Postgres. Logs attached in taskexecutor.out
>          - This occurs in the open() of our custom RichSinkFunction
>          "PGSink", where the exception is caught and we call
>          Runtime.getRuntime().exit(1)
>       - Moments before the issue starts (at 2021-09-05 03:19:19) we
>       observe the TaskManager canceling the previous instance of this job, as
>       indicated in taskexecutor.log
>       - This corresponds with logs in standalonesession.log of "Trying to
>          recover from a global failure...FlinkRuntimeException: Exceeded checkpoint
>          tolerable failure threshold"
>          - standalonesession.log indicates this exception happens
>          multiple times before this event but in the earlier cases automatic recover
>          seems to work correctly.
>       - Moments after the issue starts (at 2021-09-05 03:20:37) in
>       standalonesession.log we see another occurrence of "Exceeded
>       checkpoint tolerable failure threshold" with considerably different
>       following logs than the previous.
>          - At about this time and following we observe logs in
>          standalongsession.log reporting "akka...NettyTransport...Remote
>          connection to [null] failed with java.net.ConnectException: Connection
>          refused: /172.30.172.30:40763" repeatedly
>       - In standalonesession.log we then see one occurrence of "...TimeoutException:
>       Heartbeat of TaskManager with id 172.30.172.30:40763-75a3ae timed
>       out" (at 2021-09-05 03:21:19) - this exception never appears to be
>       repeated.
>       - Immediately following that exception the repeating failure loop
>       seems to begin:
>          - The first log of the loop, in standalonesession.log, appears
>          to be "Discarding the results produced by task execution 905e1..."
>          905e1... corresponds with the "TimeoutException: Heartbeat..."
>          exception
>          - That last log before repeating is "...CompletionException:
>          ...NoResourceAvailableException: Could not acquire the minimum required
>          resources", and this line will be followed "Discarding the
>          results..." with the hash id matching this exception's.
>          - This loop then repeats indefinitely.
>       - During the issue, $ flink list on each machine reports the
>       following:
>       ------------------ Running/Restarting Jobs -------------------
>       04.09.2021 16:03:39 : 3143a6c3ab0a5f58130384c13b2eca45 :
>       FlinkSignalProcessor (RESTARTING)
>       04.09.2021 16:45:08 : 4d531101945f0726432f73f37a38ee48 :
>       FlinkMeasureProcessor (RESTARTING)
>
>
>    - When we observe the issue
>       - On each deployment we use Terraform to deploy a cluster of
>       "funnel" machines on AWS EC2, each running two Flink jobs, "...Measure..."
>       and "...Signal...", that pull from some upstream RabbitMQ queue and output
>       into our Postgres db and Redis cache respectively.
>       - After deployment we always observe these jobs are functioning
>       properly in the RUNNING state on each machine, but some time the night
>       after deployment we observe that the jobs become nonfunctional, reporting
>       RESTARTING indefinitely.
>       - This appears to occur to both jobs on every machine in the
>       cluster simultaneously, implying some currently unknown root cause.
>       - Stopping the jobs and jobmanager on each machine, and restarting
>       them, reliably fixes the issue and it does not occur again after that first
>       night following deployment.
>
>