You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Gyula Fóra <gy...@gmail.com> on 2023/01/21 22:43:03 UTC

Job stuck in CREATED state with scheduling failures

Hi Devs!

We noticed a very strange failure scenario a few times recently with the
Native Kubernetes integration.

The issue is triggered by a heartbeat timeout (a temporary network
problem). We observe the following behaviour:

===================================
3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):

1. Temporary network problem
 - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
 - Both the JM and TM1 trigger the job failure on their sides and cancel
the tasks
 - JM releases TM1 slots

2. While failing/cancelling the job, the network connection recovers and
TM1 reconnects to JM:
*TM1: Resolved JobManager address, beginning registration*

3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
failing as it cannot seem to allocate all the resources:

*NoResourceAvailableException: Slot request bulk is not fulfillable! Could
not allocate the required slot within slot request timeout*
On TM1 we see the following logs repeating (mutliple times every few
seconds until the slot request times out after 5 minutes):
*Receive slot request ... for job ... from resource manager with leader id
...*
*Allocated slot for ...*
*Receive slot request ... for job ... from resource manager with leader id
...*
*Allocated slot for ....*
*Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
ResourceProfile{...}, allocationId: ..., jobId: ...).*

While all these are happening on TM1 we don't see any allocation related
INFO logs on TM2.
===================================

Seems like something weird happens when TM1 reconnects after the heartbeat
loss. I feel that the JM should probably shut down the TM and create a new
one. But instead it gets stuck.

Any ideas what could be happening here?

Thanks
Gyula

Re: Job stuck in CREATED state with scheduling failures

Posted by Divya Sanghi <bi...@gmail.com>.

How to unsubscribe from mailing list ?

On Thu, May 18, 2023 at 1:03 AM Gyula Fóra <gy...@gmail.com> wrote:

> Hey Devs!
>
> I am bumping this thread to see if someone has any ideas how to go about
> solving this.
>
> Yang Wang earlier had this comment but I am not sure how to proceed:
>
> "From the logs you have provided, I find a potential bug in the current
> leader retrieval. In DefaultLeaderRetrievalService , if the leader
> information does not change, we will not notify the listener. It is indeed
> correct in all-most scenarios and could save some following heavy
> operations. But in the current case, it might be the root cause. For TM1,
> we added 00000000000000000000000000000002 for job leader monitoring at
> 2023-01-18 05:31:23,848. However, we never get the next expected log
> “Resolved JobManager address, beginning registration”. It just because the
> leader information does not change. So the TM1 got stuck at waiting for the
> leader and never registered to the JM. Finally, the job failed with no
> enough slots."
>
> I wonder if someone could maybe confirm the current behaviour.
>
> Thanks
> Gyula
>
> On Mon, Jan 23, 2023 at 4:06 PM Tamir Sagi <Ta...@niceactimize.com>
> wrote:
>
>> Hey Gyula,
>>
>> We encountered similar issues recently . Our Flink stream application
>> clusters(v1.15.2) are running in AWS EKS.
>>
>>
>>    1. TM gets disconnected sporadically and never returns.
>>
>> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with
>> id aml-rule-eval-stream-taskmanager-1-1 is no longer reachable.
>>
>>     at
>> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387)
>>
>>     at
>> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123)
>>
>> heartbeat.timeout is set to 15 minutes.
>>
>>
>> There are some heartbeat updates on Flink web-UI
>>
>>
>> There are not enough logs about it and no indication of OOM whatsoever
>> within k8s. However, We increased the TMs' memory, and the issue seems to
>> be resolved for now. (yet, it might hide a bigger issue).
>>
>> The 2nd issue is regarding  'NoResourceAvailableException' with the
>> following error message
>> Caused by:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Slot request bulk is not fulfillable! Could not allocate the required slot
>> within slot request timeout (Enclosed log files.)
>>
>> I also found this unresolved ticket [1] with suggestion by @Yang Wang
>> <da...@gmail.com> which seems to be working so far.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-25649
>>
>> Any thoughts?
>>
>> Thanks,
>> Tamir.
>>
>> ------------------------------
>> *From:* Gyula Fóra <gy...@gmail.com>
>> *Sent:* Sunday, January 22, 2023 12:43 AM
>> *To:* user <us...@flink.apache.org>
>> *Subject:* Job stuck in CREATED state with scheduling failures
>>
>>
>> *EXTERNAL EMAIL*
>>
>>
>> Hi Devs!
>>
>> We noticed a very strange failure scenario a few times recently with the
>> Native Kubernetes integration.
>>
>> The issue is triggered by a heartbeat timeout (a temporary network
>> problem). We observe the following behaviour:
>>
>> ===================================
>> 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):
>>
>> 1. Temporary network problem
>>  - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
>>  - Both the JM and TM1 trigger the job failure on their sides and cancel
>> the tasks
>>  - JM releases TM1 slots
>>
>> 2. While failing/cancelling the job, the network connection recovers and
>> TM1 reconnects to JM:
>> *TM1: Resolved JobManager address, beginning registration*
>>
>> 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
>> failing as it cannot seem to allocate all the resources:
>>
>> *NoResourceAvailableException: Slot request bulk is not fulfillable!
>> Could not allocate the required slot within slot request timeout *
>> On TM1 we see the following logs repeating (mutliple times every few
>> seconds until the slot request times out after 5 minutes):
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ...*
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ....*
>> *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>> ResourceProfile{...}, allocationId: ..., jobId: ...).*
>>
>> While all these are happening on TM1 we don't see any allocation related
>> INFO logs on TM2.
>> ===================================
>>
>> Seems like something weird happens when TM1 reconnects after the
>> heartbeat loss. I feel that the JM should probably shut down the TM and
>> create a new one. But instead it gets stuck.
>>
>> Any ideas what could be happening here?
>>
>> Thanks
>> Gyula
>>
>>
>> Confidentiality: This communication and any attachments are intended for
>> the above-named persons only and may be confidential and/or legally
>> privileged. Any opinions expressed in this communication are not
>> necessarily those of NICE Actimize. If this communication has come to you
>> in error you must take no action based on it, nor must you copy or show it
>> to anyone; please delete/destroy and inform the sender by e-mail
>> immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this e-mail
>> and attachments are free from any virus, we advise that in keeping with
>> good computing practice the recipient should ensure they are actually virus
>> free.
>>
>

Re: Job stuck in CREATED state with scheduling failures

Posted by Matthias Pohl via user <us...@flink.apache.org>.

Hi Gyula,
Could you share the logs in the ML? Or is there a Jira issue I missed?

Matthias

On Wed, May 17, 2023 at 9:33 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hey Devs!
>
> I am bumping this thread to see if someone has any ideas how to go about
> solving this.
>
> Yang Wang earlier had this comment but I am not sure how to proceed:
>
> "From the logs you have provided, I find a potential bug in the current
> leader retrieval. In DefaultLeaderRetrievalService , if the leader
> information does not change, we will not notify the listener. It is indeed
> correct in all-most scenarios and could save some following heavy
> operations. But in the current case, it might be the root cause. For TM1,
> we added 00000000000000000000000000000002 for job leader monitoring at
> 2023-01-18 05:31:23,848. However, we never get the next expected log
> “Resolved JobManager address, beginning registration”. It just because the
> leader information does not change. So the TM1 got stuck at waiting for the
> leader and never registered to the JM. Finally, the job failed with no
> enough slots."
>
> I wonder if someone could maybe confirm the current behaviour.
>
> Thanks
> Gyula
>
> On Mon, Jan 23, 2023 at 4:06 PM Tamir Sagi <Ta...@niceactimize.com>
> wrote:
>
>> Hey Gyula,
>>
>> We encountered similar issues recently . Our Flink stream application
>> clusters(v1.15.2) are running in AWS EKS.
>>
>>
>>    1. TM gets disconnected sporadically and never returns.
>>
>> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with
>> id aml-rule-eval-stream-taskmanager-1-1 is no longer reachable.
>>
>>     at
>> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387)
>>
>>     at
>> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123)
>>
>> heartbeat.timeout is set to 15 minutes.
>>
>>
>> There are some heartbeat updates on Flink web-UI
>>
>>
>> There are not enough logs about it and no indication of OOM whatsoever
>> within k8s. However, We increased the TMs' memory, and the issue seems to
>> be resolved for now. (yet, it might hide a bigger issue).
>>
>> The 2nd issue is regarding  'NoResourceAvailableException' with the
>> following error message
>> Caused by:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Slot request bulk is not fulfillable! Could not allocate the required slot
>> within slot request timeout (Enclosed log files.)
>>
>> I also found this unresolved ticket [1] with suggestion by @Yang Wang
>> <da...@gmail.com> which seems to be working so far.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-25649
>>
>> Any thoughts?
>>
>> Thanks,
>> Tamir.
>>
>> ------------------------------
>> *From:* Gyula Fóra <gy...@gmail.com>
>> *Sent:* Sunday, January 22, 2023 12:43 AM
>> *To:* user <us...@flink.apache.org>
>> *Subject:* Job stuck in CREATED state with scheduling failures
>>
>>
>> *EXTERNAL EMAIL*
>>
>>
>> Hi Devs!
>>
>> We noticed a very strange failure scenario a few times recently with the
>> Native Kubernetes integration.
>>
>> The issue is triggered by a heartbeat timeout (a temporary network
>> problem). We observe the following behaviour:
>>
>> ===================================
>> 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):
>>
>> 1. Temporary network problem
>>  - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
>>  - Both the JM and TM1 trigger the job failure on their sides and cancel
>> the tasks
>>  - JM releases TM1 slots
>>
>> 2. While failing/cancelling the job, the network connection recovers and
>> TM1 reconnects to JM:
>> *TM1: Resolved JobManager address, beginning registration*
>>
>> 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
>> failing as it cannot seem to allocate all the resources:
>>
>> *NoResourceAvailableException: Slot request bulk is not fulfillable!
>> Could not allocate the required slot within slot request timeout *
>> On TM1 we see the following logs repeating (mutliple times every few
>> seconds until the slot request times out after 5 minutes):
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ...*
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ....*
>> *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>> ResourceProfile{...}, allocationId: ..., jobId: ...).*
>>
>> While all these are happening on TM1 we don't see any allocation related
>> INFO logs on TM2.
>> ===================================
>>
>> Seems like something weird happens when TM1 reconnects after the
>> heartbeat loss. I feel that the JM should probably shut down the TM and
>> create a new one. But instead it gets stuck.
>>
>> Any ideas what could be happening here?
>>
>> Thanks
>> Gyula
>>
>>
>> Confidentiality: This communication and any attachments are intended for
>> the above-named persons only and may be confidential and/or legally
>> privileged. Any opinions expressed in this communication are not
>> necessarily those of NICE Actimize. If this communication has come to you
>> in error you must take no action based on it, nor must you copy or show it
>> to anyone; please delete/destroy and inform the sender by e-mail
>> immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this e-mail
>> and attachments are free from any virus, we advise that in keeping with
>> good computing practice the recipient should ensure they are actually virus
>> free.
>>
>

Re: Job stuck in CREATED state with scheduling failures

Posted by Matthias Pohl <ma...@aiven.io.INVALID>.

Hi Gyula,
Could you share the logs in the ML? Or is there a Jira issue I missed?

Matthias

On Wed, May 17, 2023 at 9:33 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hey Devs!
>
> I am bumping this thread to see if someone has any ideas how to go about
> solving this.
>
> Yang Wang earlier had this comment but I am not sure how to proceed:
>
> "From the logs you have provided, I find a potential bug in the current
> leader retrieval. In DefaultLeaderRetrievalService , if the leader
> information does not change, we will not notify the listener. It is indeed
> correct in all-most scenarios and could save some following heavy
> operations. But in the current case, it might be the root cause. For TM1,
> we added 00000000000000000000000000000002 for job leader monitoring at
> 2023-01-18 05:31:23,848. However, we never get the next expected log
> “Resolved JobManager address, beginning registration”. It just because the
> leader information does not change. So the TM1 got stuck at waiting for the
> leader and never registered to the JM. Finally, the job failed with no
> enough slots."
>
> I wonder if someone could maybe confirm the current behaviour.
>
> Thanks
> Gyula
>
> On Mon, Jan 23, 2023 at 4:06 PM Tamir Sagi <Ta...@niceactimize.com>
> wrote:
>
>> Hey Gyula,
>>
>> We encountered similar issues recently . Our Flink stream application
>> clusters(v1.15.2) are running in AWS EKS.
>>
>>
>>    1. TM gets disconnected sporadically and never returns.
>>
>> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with
>> id aml-rule-eval-stream-taskmanager-1-1 is no longer reachable.
>>
>>     at
>> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387)
>>
>>     at
>> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123)
>>
>> heartbeat.timeout is set to 15 minutes.
>>
>>
>> There are some heartbeat updates on Flink web-UI
>>
>>
>> There are not enough logs about it and no indication of OOM whatsoever
>> within k8s. However, We increased the TMs' memory, and the issue seems to
>> be resolved for now. (yet, it might hide a bigger issue).
>>
>> The 2nd issue is regarding  'NoResourceAvailableException' with the
>> following error message
>> Caused by:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Slot request bulk is not fulfillable! Could not allocate the required slot
>> within slot request timeout (Enclosed log files.)
>>
>> I also found this unresolved ticket [1] with suggestion by @Yang Wang
>> <da...@gmail.com> which seems to be working so far.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-25649
>>
>> Any thoughts?
>>
>> Thanks,
>> Tamir.
>>
>> ------------------------------
>> *From:* Gyula Fóra <gy...@gmail.com>
>> *Sent:* Sunday, January 22, 2023 12:43 AM
>> *To:* user <us...@flink.apache.org>
>> *Subject:* Job stuck in CREATED state with scheduling failures
>>
>>
>> *EXTERNAL EMAIL*
>>
>>
>> Hi Devs!
>>
>> We noticed a very strange failure scenario a few times recently with the
>> Native Kubernetes integration.
>>
>> The issue is triggered by a heartbeat timeout (a temporary network
>> problem). We observe the following behaviour:
>>
>> ===================================
>> 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):
>>
>> 1. Temporary network problem
>>  - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
>>  - Both the JM and TM1 trigger the job failure on their sides and cancel
>> the tasks
>>  - JM releases TM1 slots
>>
>> 2. While failing/cancelling the job, the network connection recovers and
>> TM1 reconnects to JM:
>> *TM1: Resolved JobManager address, beginning registration*
>>
>> 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
>> failing as it cannot seem to allocate all the resources:
>>
>> *NoResourceAvailableException: Slot request bulk is not fulfillable!
>> Could not allocate the required slot within slot request timeout *
>> On TM1 we see the following logs repeating (mutliple times every few
>> seconds until the slot request times out after 5 minutes):
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ...*
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ....*
>> *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>> ResourceProfile{...}, allocationId: ..., jobId: ...).*
>>
>> While all these are happening on TM1 we don't see any allocation related
>> INFO logs on TM2.
>> ===================================
>>
>> Seems like something weird happens when TM1 reconnects after the
>> heartbeat loss. I feel that the JM should probably shut down the TM and
>> create a new one. But instead it gets stuck.
>>
>> Any ideas what could be happening here?
>>
>> Thanks
>> Gyula
>>
>>
>> Confidentiality: This communication and any attachments are intended for
>> the above-named persons only and may be confidential and/or legally
>> privileged. Any opinions expressed in this communication are not
>> necessarily those of NICE Actimize. If this communication has come to you
>> in error you must take no action based on it, nor must you copy or show it
>> to anyone; please delete/destroy and inform the sender by e-mail
>> immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this e-mail
>> and attachments are free from any virus, we advise that in keeping with
>> good computing practice the recipient should ensure they are actually virus
>> free.
>>
>

Re: Job stuck in CREATED state with scheduling failures

Posted by Gyula Fóra <gy...@gmail.com>.

Hey Devs!

I am bumping this thread to see if someone has any ideas how to go about
solving this.

Yang Wang earlier had this comment but I am not sure how to proceed:

"From the logs you have provided, I find a potential bug in the current
leader retrieval. In DefaultLeaderRetrievalService , if the leader
information does not change, we will not notify the listener. It is indeed
correct in all-most scenarios and could save some following heavy
operations. But in the current case, it might be the root cause. For TM1,
we added 00000000000000000000000000000002 for job leader monitoring at
2023-01-18 05:31:23,848. However, we never get the next expected log
“Resolved JobManager address, beginning registration”. It just because the
leader information does not change. So the TM1 got stuck at waiting for the
leader and never registered to the JM. Finally, the job failed with no
enough slots."

I wonder if someone could maybe confirm the current behaviour.

Thanks
Gyula

On Mon, Jan 23, 2023 at 4:06 PM Tamir Sagi <Ta...@niceactimize.com>
wrote:

> Hey Gyula,
>
> We encountered similar issues recently . Our Flink stream application
> clusters(v1.15.2) are running in AWS EKS.
>
>
>    1. TM gets disconnected sporadically and never returns.
>
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id
> aml-rule-eval-stream-taskmanager-1-1 is no longer reachable.
>
>     at
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387)
>
>     at
> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123)
>
> heartbeat.timeout is set to 15 minutes.
>
>
> There are some heartbeat updates on Flink web-UI
>
>
> There are not enough logs about it and no indication of OOM whatsoever
> within k8s. However, We increased the TMs' memory, and the issue seems to
> be resolved for now. (yet, it might hide a bigger issue).
>
> The 2nd issue is regarding  'NoResourceAvailableException' with the
> following error message
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout (Enclosed log files.)
>
> I also found this unresolved ticket [1] with suggestion by @Yang Wang
> <da...@gmail.com> which seems to be working so far.
>
> [1] https://issues.apache.org/jira/browse/FLINK-25649
>
> Any thoughts?
>
> Thanks,
> Tamir.
>
> ------------------------------
> *From:* Gyula Fóra <gy...@gmail.com>
> *Sent:* Sunday, January 22, 2023 12:43 AM
> *To:* user <us...@flink.apache.org>
> *Subject:* Job stuck in CREATED state with scheduling failures
>
>
> *EXTERNAL EMAIL*
>
>
> Hi Devs!
>
> We noticed a very strange failure scenario a few times recently with the
> Native Kubernetes integration.
>
> The issue is triggered by a heartbeat timeout (a temporary network
> problem). We observe the following behaviour:
>
> ===================================
> 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):
>
> 1. Temporary network problem
>  - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
>  - Both the JM and TM1 trigger the job failure on their sides and cancel
> the tasks
>  - JM releases TM1 slots
>
> 2. While failing/cancelling the job, the network connection recovers and
> TM1 reconnects to JM:
> *TM1: Resolved JobManager address, beginning registration*
>
> 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
> failing as it cannot seem to allocate all the resources:
>
> *NoResourceAvailableException: Slot request bulk is not fulfillable! Could
> not allocate the required slot within slot request timeout *
> On TM1 we see the following logs repeating (mutliple times every few
> seconds until the slot request times out after 5 minutes):
> *Receive slot request ... for job ... from resource manager with leader id
> ...*
> *Allocated slot for ...*
> *Receive slot request ... for job ... from resource manager with leader id
> ...*
> *Allocated slot for ....*
> *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
> ResourceProfile{...}, allocationId: ..., jobId: ...).*
>
> While all these are happening on TM1 we don't see any allocation related
> INFO logs on TM2.
> ===================================
>
> Seems like something weird happens when TM1 reconnects after the heartbeat
> loss. I feel that the JM should probably shut down the TM and create a new
> one. But instead it gets stuck.
>
> Any ideas what could be happening here?
>
> Thanks
> Gyula
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>

Re: Job stuck in CREATED state with scheduling failures

Posted by Gyula Fóra <gy...@gmail.com>.

Hey Devs!

I am bumping this thread to see if someone has any ideas how to go about
solving this.

Yang Wang earlier had this comment but I am not sure how to proceed:

"From the logs you have provided, I find a potential bug in the current
leader retrieval. In DefaultLeaderRetrievalService , if the leader
information does not change, we will not notify the listener. It is indeed
correct in all-most scenarios and could save some following heavy
operations. But in the current case, it might be the root cause. For TM1,
we added 00000000000000000000000000000002 for job leader monitoring at
2023-01-18 05:31:23,848. However, we never get the next expected log
“Resolved JobManager address, beginning registration”. It just because the
leader information does not change. So the TM1 got stuck at waiting for the
leader and never registered to the JM. Finally, the job failed with no
enough slots."

I wonder if someone could maybe confirm the current behaviour.

Thanks
Gyula

On Mon, Jan 23, 2023 at 4:06 PM Tamir Sagi <Ta...@niceactimize.com>
wrote:

> Hey Gyula,
>
> We encountered similar issues recently . Our Flink stream application
> clusters(v1.15.2) are running in AWS EKS.
>
>
>    1. TM gets disconnected sporadically and never returns.
>
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id
> aml-rule-eval-stream-taskmanager-1-1 is no longer reachable.
>
>     at
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387)
>
>     at
> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123)
>
> heartbeat.timeout is set to 15 minutes.
>
>
> There are some heartbeat updates on Flink web-UI
>
>
> There are not enough logs about it and no indication of OOM whatsoever
> within k8s. However, We increased the TMs' memory, and the issue seems to
> be resolved for now. (yet, it might hide a bigger issue).
>
> The 2nd issue is regarding  'NoResourceAvailableException' with the
> following error message
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout (Enclosed log files.)
>
> I also found this unresolved ticket [1] with suggestion by @Yang Wang
> <da...@gmail.com> which seems to be working so far.
>
> [1] https://issues.apache.org/jira/browse/FLINK-25649
>
> Any thoughts?
>
> Thanks,
> Tamir.
>
> ------------------------------
> *From:* Gyula Fóra <gy...@gmail.com>
> *Sent:* Sunday, January 22, 2023 12:43 AM
> *To:* user <us...@flink.apache.org>
> *Subject:* Job stuck in CREATED state with scheduling failures
>
>
> *EXTERNAL EMAIL*
>
>
> Hi Devs!
>
> We noticed a very strange failure scenario a few times recently with the
> Native Kubernetes integration.
>
> The issue is triggered by a heartbeat timeout (a temporary network
> problem). We observe the following behaviour:
>
> ===================================
> 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):
>
> 1. Temporary network problem
>  - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
>  - Both the JM and TM1 trigger the job failure on their sides and cancel
> the tasks
>  - JM releases TM1 slots
>
> 2. While failing/cancelling the job, the network connection recovers and
> TM1 reconnects to JM:
> *TM1: Resolved JobManager address, beginning registration*
>
> 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
> failing as it cannot seem to allocate all the resources:
>
> *NoResourceAvailableException: Slot request bulk is not fulfillable! Could
> not allocate the required slot within slot request timeout *
> On TM1 we see the following logs repeating (mutliple times every few
> seconds until the slot request times out after 5 minutes):
> *Receive slot request ... for job ... from resource manager with leader id
> ...*
> *Allocated slot for ...*
> *Receive slot request ... for job ... from resource manager with leader id
> ...*
> *Allocated slot for ....*
> *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
> ResourceProfile{...}, allocationId: ..., jobId: ...).*
>
> While all these are happening on TM1 we don't see any allocation related
> INFO logs on TM2.
> ===================================
>
> Seems like something weird happens when TM1 reconnects after the heartbeat
> loss. I feel that the JM should probably shut down the TM and create a new
> one. But instead it gets stuck.
>
> Any ideas what could be happening here?
>
> Thanks
> Gyula
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>

Re: Job stuck in CREATED state with scheduling failures

Posted by Tamir Sagi <Ta...@niceactimize.com>.

Hey Gyula,

We encountered similar issues recently . Our Flink stream application clusters(v1.15.2) are running in AWS EKS.


  1.  TM gets disconnected sporadically and never returns.

org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id aml-rule-eval-stream-taskmanager-1-1 is no longer reachable.

    at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387)

    at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123)

heartbeat.timeout is set to 15 minutes.
[cid:bea7c3fd-f4f8-4b58-9f7c-81f633aad1d8]

There are some heartbeat updates on Flink web-UI

[cid:c777102c-534e-4fae-b5ec-b2699e11d566]
There are not enough logs about it and no indication of OOM whatsoever within k8s. However, We increased the TMs' memory, and the issue seems to be resolved for now. (yet, it might hide a bigger issue).

The 2nd issue is regarding  'NoResourceAvailableException' with the following error message
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout (Enclosed log files.)

I also found this unresolved ticket [1] with suggestion by @Yang Wang<ma...@gmail.com> which seems to be working so far.

[1] https://issues.apache.org/jira/browse/FLINK-25649

Any thoughts?

Thanks,
Tamir.

________________________________
From: Gyula Fóra <gy...@gmail.com>
Sent: Sunday, January 22, 2023 12:43 AM
To: user <us...@flink.apache.org>
Subject: Job stuck in CREATED state with scheduling failures


EXTERNAL EMAIL


Hi Devs!

We noticed a very strange failure scenario a few times recently with the Native Kubernetes integration.

The issue is triggered by a heartbeat timeout (a temporary network problem). We observe the following behaviour:

===================================
3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):

1. Temporary network problem
 - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
 - Both the JM and TM1 trigger the job failure on their sides and cancel the tasks
 - JM releases TM1 slots

2. While failing/cancelling the job, the network connection recovers and TM1 reconnects to JM:
TM1: Resolved JobManager address, beginning registration

3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps failing as it cannot seem to allocate all the resources:
NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

On TM1 we see the following logs repeating (mutliple times every few seconds until the slot request times out after 5 minutes):
Receive slot request ... for job ... from resource manager with leader id ...
Allocated slot for ...
Receive slot request ... for job ... from resource manager with leader id ...
Allocated slot for ....
Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: ResourceProfile{...}, allocationId: ..., jobId: ...).

While all these are happening on TM1 we don't see any allocation related INFO logs on TM2.
===================================

Seems like something weird happens when TM1 reconnects after the heartbeat loss. I feel that the JM should probably shut down the TM and create a new one. But instead it gets stuck.

Any ideas what could be happening here?

Thanks
Gyula

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.