You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Kaymak, Tobias" <to...@ricardo.ch> on 2021/09/24 12:52:58 UTC

Upgrading from 1.11.3 -> 1.13.1 - random jobs stays in "CREATED"state, then fails with Slot request bulk is not fulfillable!

Hi,

I am trying to upgrade our Flink cluster from version 1.11.3 -> 1.13.1
We use it to execute over 40 pipelines written in Apache Beam 2.32.0.

While moving the pipelines one-by-one over to the new cluster I noticed at
some point that it did not start a new pipeline after I moved about 20.

4 TM with 8 slots are running, giving 32 slots to run things.

When I kill the jobmanager pod to make it reload the config, a random
pipeline is then stuck in the CREATED state. No log is shown but after some
minutes it's visible that:

Slot request bulk is not fulfillable! Could not allocate the required slot
within slot request timeout

I found this post:
http://mail-archives.apache.org/mod_mbox/flink-issues/202106.mbox/%3CJIRA.13382840.1623216280000.576520.1623246960369@Atlassian.JIRA%3E

However, I am running the official Docker images of Flink, TM and JM are in
sync.

I checked that there is no memory pressure on TM and JM:
[image: image.png]
[image: Screenshot 2021-09-24 at 14.47.45.png]

Any advice on how to debug this situation?

jobmanager.memory.heap.size: 3500m
jobmanager.memory.jvm-overhead.max: 1536m
jobmanager.memory.process.size: 5gb
jobmanager.memory.off-heap.size: 512m
jobmanager.memory.jvm-metaspace.size: 512m

taskmanager.memory.process.size: 54gb
taskmanager.memory.jvm-metaspace.size: 2gb
taskmanager.memory.task.off-heap.size: 2gb

Best,
Tobi

RE: Upgrading from 1.11.3 -> 1.13.1 - random jobs stays in "CREATED"state, then fails with Slot request bulk is not fulfillable!

Posted by Schwalbe Matthias <Ma...@viseca.ch>.

Hi Tobias,

If your number of pipelines equals number of Flink job then this is exactly what you should observe:
It takes one slot per Flink job and parallelism, hence for parallelism 1 you would have to provide at least 40 slots.

… independent of Flink version

… for Beam on Flink I’m not sure, assuming similar matters


Thias


From: Kaymak, Tobias <to...@ricardo.ch>
Sent: Freitag, 24. September 2021 14:53
To: user <us...@flink.apache.org>
Subject: Upgrading from 1.11.3 -> 1.13.1 - random jobs stays in "CREATED"state, then fails with Slot request bulk is not fulfillable!

Hi,

I am trying to upgrade our Flink cluster from version 1.11.3 -> 1.13.1
We use it to execute over 40 pipelines written in Apache Beam 2.32.0.

While moving the pipelines one-by-one over to the new cluster I noticed at some point that it did not start a new pipeline after I moved about 20.

4 TM with 8 slots are running, giving 32 slots to run things.

When I kill the jobmanager pod to make it reload the config, a random pipeline is then stuck in the CREATED state. No log is shown but after some minutes it's visible that:

Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

I found this post: http://mail-archives.apache.org/mod_mbox/flink-issues/202106.mbox/%3CJIRA.13382840.1623216280000.576520.1623246960369@Atlassian.JIRA%3E

However, I am running the official Docker images of Flink, TM and JM are in sync.

I checked that there is no memory pressure on TM and JM:
[cid:image001.png@01D7B50B.6E949D20]
[cid:image002.png@01D7B50B.6E949D20]

Any advice on how to debug this situation?

jobmanager.memory.heap.size: 3500m
jobmanager.memory.jvm-overhead.max: 1536m
jobmanager.memory.process.size: 5gb
jobmanager.memory.off-heap.size: 512m
jobmanager.memory.jvm-metaspace.size: 512m

taskmanager.memory.process.size: 54gb
taskmanager.memory.jvm-metaspace.size: 2gb
taskmanager.memory.task.off-heap.size: 2gb

Best,
Tobi
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng verboten.

This message is intended only for the named recipient and may contain confidential or privileged information. As the confidentiality of email communication cannot be guaranteed, we do not accept any responsibility for the confidentiality and the intactness of this message. If you have received it in error, please advise the sender by return e-mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.

Re: Upgrading from 1.11.3 -> 1.13.1 - random jobs stays in "CREATED"state, then fails with Slot request bulk is not fulfillable!

Posted by "Kaymak, Tobias" <to...@ricardo.ch>.

Ok thank you, I will test 1.13.3 then! :)

On Sun, Sep 26, 2021 at 6:17 AM Guowei Ma <gu...@gmail.com> wrote:

>
> Sorry for a typos. What i want to say is this
> https://issues.apache.org/jira/browse/FLINK-24005
>
> Best,
> Guowei
>
>
> On Sun, Sep 26, 2021 at 12:14 PM Guowei Ma <gu...@gmail.com> wrote:
>
>> Hi Tobi
>>
>> I understand the question you want to ask is: Why does a job get stuck
>> and not scheduled after a failover occurs.(Correct me if I miss something)
>> Since there is no specific log, I cannot give you an accurate answer.
>> But I have encountered a similar problem in our production. According to
>> our analysis, this is mainly caused by
>> https://issues.apache.org/jira/browse/FLINK-22938
>> This would be resolved in 1.13.3. Or you can play a patch and test
>> yourself.
>>
>> Best,
>> Guowei
>>
>>
>> On Fri, Sep 24, 2021 at 8:54 PM Kaymak, Tobias <to...@ricardo.ch>
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to upgrade our Flink cluster from version 1.11.3 -> 1.13.1
>>> We use it to execute over 40 pipelines written in Apache Beam 2.32.0.
>>>
>>> While moving the pipelines one-by-one over to the new cluster I noticed
>>> at some point that it did not start a new pipeline after I moved about 20.
>>>
>>> 4 TM with 8 slots are running, giving 32 slots to run things.
>>>
>>> When I kill the jobmanager pod to make it reload the config, a random
>>> pipeline is then stuck in the CREATED state. No log is shown but after some
>>> minutes it's visible that:
>>>
>>> Slot request bulk is not fulfillable! Could not allocate the required
>>> slot within slot request timeout
>>>
>>> I found this post:
>>> http://mail-archives.apache.org/mod_mbox/flink-issues/202106.mbox/%3CJIRA.13382840.1623216280000.576520.1623246960369@Atlassian.JIRA%3E
>>>
>>> However, I am running the official Docker images of Flink, TM and JM are
>>> in sync.
>>>
>>> I checked that there is no memory pressure on TM and JM:
>>> [image: image.png]
>>> [image: Screenshot 2021-09-24 at 14.47.45.png]
>>>
>>> Any advice on how to debug this situation?
>>>
>>> jobmanager.memory.heap.size: 3500m
>>> jobmanager.memory.jvm-overhead.max: 1536m
>>> jobmanager.memory.process.size: 5gb
>>> jobmanager.memory.off-heap.size: 512m
>>> jobmanager.memory.jvm-metaspace.size: 512m
>>>
>>> taskmanager.memory.process.size: 54gb
>>> taskmanager.memory.jvm-metaspace.size: 2gb
>>> taskmanager.memory.task.off-heap.size: 2gb
>>>
>>> Best,
>>> Tobi
>>>
>>>

Re: Upgrading from 1.11.3 -> 1.13.1 - random jobs stays in "CREATED"state, then fails with Slot request bulk is not fulfillable!

Posted by Guowei Ma <gu...@gmail.com>.

Sorry for a typos. What i want to say is this
https://issues.apache.org/jira/browse/FLINK-24005

Best,
Guowei


On Sun, Sep 26, 2021 at 12:14 PM Guowei Ma <gu...@gmail.com> wrote:

> Hi Tobi
>
> I understand the question you want to ask is: Why does a job get stuck and
> not scheduled after a failover occurs.(Correct me if I miss something)
> Since there is no specific log, I cannot give you an accurate answer.
> But I have encountered a similar problem in our production. According to
> our analysis, this is mainly caused by
> https://issues.apache.org/jira/browse/FLINK-22938
> This would be resolved in 1.13.3. Or you can play a patch and test
> yourself.
>
> Best,
> Guowei
>
>
> On Fri, Sep 24, 2021 at 8:54 PM Kaymak, Tobias <to...@ricardo.ch>
> wrote:
>
>> Hi,
>>
>> I am trying to upgrade our Flink cluster from version 1.11.3 -> 1.13.1
>> We use it to execute over 40 pipelines written in Apache Beam 2.32.0.
>>
>> While moving the pipelines one-by-one over to the new cluster I noticed
>> at some point that it did not start a new pipeline after I moved about 20.
>>
>> 4 TM with 8 slots are running, giving 32 slots to run things.
>>
>> When I kill the jobmanager pod to make it reload the config, a random
>> pipeline is then stuck in the CREATED state. No log is shown but after some
>> minutes it's visible that:
>>
>> Slot request bulk is not fulfillable! Could not allocate the required
>> slot within slot request timeout
>>
>> I found this post:
>> http://mail-archives.apache.org/mod_mbox/flink-issues/202106.mbox/%3CJIRA.13382840.1623216280000.576520.1623246960369@Atlassian.JIRA%3E
>>
>> However, I am running the official Docker images of Flink, TM and JM are
>> in sync.
>>
>> I checked that there is no memory pressure on TM and JM:
>> [image: image.png]
>> [image: Screenshot 2021-09-24 at 14.47.45.png]
>>
>> Any advice on how to debug this situation?
>>
>> jobmanager.memory.heap.size: 3500m
>> jobmanager.memory.jvm-overhead.max: 1536m
>> jobmanager.memory.process.size: 5gb
>> jobmanager.memory.off-heap.size: 512m
>> jobmanager.memory.jvm-metaspace.size: 512m
>>
>> taskmanager.memory.process.size: 54gb
>> taskmanager.memory.jvm-metaspace.size: 2gb
>> taskmanager.memory.task.off-heap.size: 2gb
>>
>> Best,
>> Tobi
>>
>>

Re: Upgrading from 1.11.3 -> 1.13.1 - random jobs stays in "CREATED"state, then fails with Slot request bulk is not fulfillable!

Posted by Guowei Ma <gu...@gmail.com>.

Hi Tobi

I understand the question you want to ask is: Why does a job get stuck and
not scheduled after a failover occurs.(Correct me if I miss something)
Since there is no specific log, I cannot give you an accurate answer.
But I have encountered a similar problem in our production. According to
our analysis, this is mainly caused by
https://issues.apache.org/jira/browse/FLINK-22938
This would be resolved in 1.13.3. Or you can play a patch and test yourself.

Best,
Guowei


On Fri, Sep 24, 2021 at 8:54 PM Kaymak, Tobias <to...@ricardo.ch>
wrote:

> Hi,
>
> I am trying to upgrade our Flink cluster from version 1.11.3 -> 1.13.1
> We use it to execute over 40 pipelines written in Apache Beam 2.32.0.
>
> While moving the pipelines one-by-one over to the new cluster I noticed at
> some point that it did not start a new pipeline after I moved about 20.
>
> 4 TM with 8 slots are running, giving 32 slots to run things.
>
> When I kill the jobmanager pod to make it reload the config, a random
> pipeline is then stuck in the CREATED state. No log is shown but after some
> minutes it's visible that:
>
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout
>
> I found this post:
> http://mail-archives.apache.org/mod_mbox/flink-issues/202106.mbox/%3CJIRA.13382840.1623216280000.576520.1623246960369@Atlassian.JIRA%3E
>
> However, I am running the official Docker images of Flink, TM and JM are
> in sync.
>
> I checked that there is no memory pressure on TM and JM:
> [image: image.png]
> [image: Screenshot 2021-09-24 at 14.47.45.png]
>
> Any advice on how to debug this situation?
>
> jobmanager.memory.heap.size: 3500m
> jobmanager.memory.jvm-overhead.max: 1536m
> jobmanager.memory.process.size: 5gb
> jobmanager.memory.off-heap.size: 512m
> jobmanager.memory.jvm-metaspace.size: 512m
>
> taskmanager.memory.process.size: 54gb
> taskmanager.memory.jvm-metaspace.size: 2gb
> taskmanager.memory.task.off-heap.size: 2gb
>
> Best,
> Tobi
>
>