You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by burgesschen <tc...@bloomberg.net> on 2020/01/14 16:24:35 UTC

Slots Leak Observed when

Hi guys,

Out team is observing a stability issue on our Standalone Flink clusters.

Background: The kafka cluster our flink jobs read from/ write to have some
issues and every 10 to15 mins one of the partition leaders switch. This
causes jobs that write to/ read from that topic fail and restart. Usually
this is not a problem since the jobs can restart and work with the new
partition leader. However, one of those restarts can make the jobs enter a
failing state and never be able to recover.

In the failing state. The jobmanager has exception: 

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 24, slots allocated: 12

During that time, 2 of the taskmanager are reporting that all the slots on
them are occupied, however, from the dashboard of the jobmanager, no job is
deployed to those 2 taskmanagers.

My guesstimation is that since the jobs restart fairly frequently, one of
the times the slots are not released properly when jobs failed, resulting in
the jobmanager falsely believing that those 2 taskmanagers' slots are still
occupied.

It does sound like an issue mentioned in 
https://issues.apache.org/jira/browse/FLINK-9932
but we are using 1.6.2 and according to the jira ticket, this bug is fixed
in 1.6.2

Please let me know if you have any ideas or how we can prevent it. Thank you
so much!




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Slots Leak Observed when

Posted by Till Rohrmann <tr...@apache.org>.
Hi,

have you tried one of the latest Flink versions to see whether the problem
still exists? I'm asking because there are some improvements which allow
for slot reconciliation between the TaskManager and the JobMaster [1]. As a
side note, the community is no longer supporting Flink 1.6.x.

For further debugging the DEBUG logs would be necessary.

[1] https://issues.apache.org/jira/browse/FLINK-11059

Cheers,
Till

On Wed, Jan 15, 2020 at 7:25 AM Xintong Song <to...@gmail.com> wrote:

> Hi,
> It would be helpful for understanding the problem if you could share the
> logs.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Wed, Jan 15, 2020 at 12:23 AM burgesschen <tc...@bloomberg.net>
> wrote:
>
>> Hi guys,
>>
>> Out team is observing a stability issue on our Standalone Flink clusters.
>>
>> Background: The kafka cluster our flink jobs read from/ write to have some
>> issues and every 10 to15 mins one of the partition leaders switch. This
>> causes jobs that write to/ read from that topic fail and restart. Usually
>> this is not a problem since the jobs can restart and work with the new
>> partition leader. However, one of those restarts can make the jobs enter a
>> failing state and never be able to recover.
>>
>> In the failing state. The jobmanager has exception:
>>
>>
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>> required: 24, slots allocated: 12
>>
>> During that time, 2 of the taskmanager are reporting that all the slots on
>> them are occupied, however, from the dashboard of the jobmanager, no job
>> is
>> deployed to those 2 taskmanagers.
>>
>> My guesstimation is that since the jobs restart fairly frequently, one of
>> the times the slots are not released properly when jobs failed, resulting
>> in
>> the jobmanager falsely believing that those 2 taskmanagers' slots are
>> still
>> occupied.
>>
>> It does sound like an issue mentioned in
>> https://issues.apache.org/jira/browse/FLINK-9932
>> but we are using 1.6.2 and according to the jira ticket, this bug is fixed
>> in 1.6.2
>>
>> Please let me know if you have any ideas or how we can prevent it. Thank
>> you
>> so much!
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>

Re: Slots Leak Observed when

Posted by Xintong Song <to...@gmail.com>.
Hi,
It would be helpful for understanding the problem if you could share the
logs.

Thank you~

Xintong Song



On Wed, Jan 15, 2020 at 12:23 AM burgesschen <tc...@bloomberg.net> wrote:

> Hi guys,
>
> Out team is observing a stability issue on our Standalone Flink clusters.
>
> Background: The kafka cluster our flink jobs read from/ write to have some
> issues and every 10 to15 mins one of the partition leaders switch. This
> causes jobs that write to/ read from that topic fail and restart. Usually
> this is not a problem since the jobs can restart and work with the new
> partition leader. However, one of those restarts can make the jobs enter a
> failing state and never be able to recover.
>
> In the failing state. The jobmanager has exception:
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 24, slots allocated: 12
>
> During that time, 2 of the taskmanager are reporting that all the slots on
> them are occupied, however, from the dashboard of the jobmanager, no job is
> deployed to those 2 taskmanagers.
>
> My guesstimation is that since the jobs restart fairly frequently, one of
> the times the slots are not released properly when jobs failed, resulting
> in
> the jobmanager falsely believing that those 2 taskmanagers' slots are still
> occupied.
>
> It does sound like an issue mentioned in
> https://issues.apache.org/jira/browse/FLINK-9932
> but we are using 1.6.2 and according to the jira ticket, this bug is fixed
> in 1.6.2
>
> Please let me know if you have any ideas or how we can prevent it. Thank
> you
> so much!
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>