You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Ori Popowski <or...@gmail.com> on 2022/10/02 08:34:23 UTC

Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

Hi,

We're using Flink 2.10.2 on Google Dataproc.

Lately we experience a very unusual problem: the job fails and when it's
trying to recover we get this error:

Slot request bulk is not fulfillable! Could not allocate the required slot
within slot request timeout

I investigated what happened and I saw that the failure is caused by a
heartbeat timeout to one of the containers. I looked at the container's
logs and I saw something unusual:

   1. Eight minutes before the heartbeat timeout the logs show connection
   problems to the Confluent Kafka topic and also to Datadog, which means
   there's a network issue with the whole node or just the specific container.
   2. The container logs disappear at this point, but the node logs show
   multiple Garbage Collection pauses, ranging from 10 seconds to 215 (!)
   seconds.

It looks like right after the network issue the node itself gets into an
endless GC phase, and my theory is that the slots are not fulfillable
because the node itself is not available because it gets into an endless GC.

I want to note that we've been running this job for months without any
issues. The issues started one month ago arbitrarily, not following a Flink
version upgrade, job code upgrade, change in amount or type of data being
processed, and neither a Dataproc image version change.

Attached are job manager jogs, container logs, and node logs.

How can we recover from this issue?

Thanks!

Re: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

Posted by Ori Popowski <or...@gmail.com>.
Martin, as I said, the problem is with GC, the network issue is just a
symptom.

I just wanted to say that after a lot of troubleshooting which didn't
achieve any insight we decided to use YARN Node Labels feature to run the
job only on Google Dataproc's secondary workers. The problem went away
completely and my only conclusion is that indeed, the YARN daemons on the
primary workers were the culprit. We will let Google Cloud know of this.
Unfortunately, due to the current configuration we run two redundant
machines (the two mandatory primary workers which don't do anything), so
this is only a temporary fix until we discover the real issue.


On Tue, Oct 4, 2022 at 8:23 PM Martijn Visser <ma...@apache.org>
wrote:

> Hi Ori,
>
> Thanks for reaching out! I do fear that there's not much that we can help
> out with. As you mentioned, it looks like there's a network issue which
> would be on the Google side of issues. I'm assuming that the mentioned
> Flink version corresponds with Flink 1.12 [1], which isn't supported in the
> Flink community anymore. Are you restarting the job from a savepoint or
> starting fresh without state at all?
>
> Best regards,
>
> Martijn
>
> [1]
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0
>
> On Sun, Oct 2, 2022 at 3:38 AM Ori Popowski <or...@gmail.com> wrote:
>
>> Hi,
>>
>> We're using Flink 2.10.2 on Google Dataproc.
>>
>> Lately we experience a very unusual problem: the job fails and when it's
>> trying to recover we get this error:
>>
>> Slot request bulk is not fulfillable! Could not allocate the required
>> slot within slot request timeout
>>
>> I investigated what happened and I saw that the failure is caused by a
>> heartbeat timeout to one of the containers. I looked at the container's
>> logs and I saw something unusual:
>>
>>    1. Eight minutes before the heartbeat timeout the logs show
>>    connection problems to the Confluent Kafka topic and also to Datadog, which
>>    means there's a network issue with the whole node or just the specific
>>    container.
>>    2. The container logs disappear at this point, but the node logs show
>>    multiple Garbage Collection pauses, ranging from 10 seconds to 215 (!)
>>    seconds.
>>
>> It looks like right after the network issue the node itself gets into an
>> endless GC phase, and my theory is that the slots are not fulfillable
>> because the node itself is not available because it gets into an endless GC.
>>
>> I want to note that we've been running this job for months without any
>> issues. The issues started one month ago arbitrarily, not following a Flink
>> version upgrade, job code upgrade, change in amount or type of data being
>> processed, and neither a Dataproc image version change.
>>
>> Attached are job manager jogs, container logs, and node logs.
>>
>> How can we recover from this issue?
>>
>> Thanks!
>>
>>

Re: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

Posted by Martijn Visser <ma...@apache.org>.
Hi Ori,

Thanks for reaching out! I do fear that there's not much that we can help
out with. As you mentioned, it looks like there's a network issue which
would be on the Google side of issues. I'm assuming that the mentioned
Flink version corresponds with Flink 1.12 [1], which isn't supported in the
Flink community anymore. Are you restarting the job from a savepoint or
starting fresh without state at all?

Best regards,

Martijn

[1]
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0

On Sun, Oct 2, 2022 at 3:38 AM Ori Popowski <or...@gmail.com> wrote:

> Hi,
>
> We're using Flink 2.10.2 on Google Dataproc.
>
> Lately we experience a very unusual problem: the job fails and when it's
> trying to recover we get this error:
>
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout
>
> I investigated what happened and I saw that the failure is caused by a
> heartbeat timeout to one of the containers. I looked at the container's
> logs and I saw something unusual:
>
>    1. Eight minutes before the heartbeat timeout the logs show connection
>    problems to the Confluent Kafka topic and also to Datadog, which means
>    there's a network issue with the whole node or just the specific container.
>    2. The container logs disappear at this point, but the node logs show
>    multiple Garbage Collection pauses, ranging from 10 seconds to 215 (!)
>    seconds.
>
> It looks like right after the network issue the node itself gets into an
> endless GC phase, and my theory is that the slots are not fulfillable
> because the node itself is not available because it gets into an endless GC.
>
> I want to note that we've been running this job for months without any
> issues. The issues started one month ago arbitrarily, not following a Flink
> version upgrade, job code upgrade, change in amount or type of data being
> processed, and neither a Dataproc image version change.
>
> Attached are job manager jogs, container logs, and node logs.
>
> How can we recover from this issue?
>
> Thanks!
>
>