You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Hemanga Borah <bo...@gmail.com> on 2022/07/31 17:47:01 UTC

Issues with Flink scheduler?

Hello guys,
 We have been seeing an issue with our Flink applications. Our applications
run fine for several hours, and then we see an error/exception like so:

java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could
not acquire the minimum required resources.

For some applications, this error/exception appears once, which stays in
history for a while and but the job recovers. However, for some
applications, we see this error thrown repeatedly, and the application gets
into a crash loop.

Since our application had been running fine for several hours before we see
such a message, our suspicion is that when the crash happens, the job
manager aggressively tries to start back the job, and is not able to
acquire enough resources because the previous job has not cleaned up as yet.

Has anyone else been seeing this issue? If so, what did you guys try to fix
it?

Thanks,
HKB

Re: Issues with Flink scheduler?

Posted by Weihua Hu <hu...@gmail.com>.

Hi, Hemanga
Could not acquire the minimum required resources
-----

This log just shows that there are not enough task managers to schedule
your job.
Referring to your description, maybe there was some problem with creating
the task manager. Maybe you can check the status of the task manager pod
when this problem recurs.
And the full jobmanager.log would be helpful to debug the problem.

Best,
Weihua


On Tue, Aug 2, 2022 at 2:40 AM Hemanga Borah <bo...@gmail.com>
wrote:

> We are using 1.14 version currently. The final manifestation of the issue
> shows up as the trace I pasted above, and then the job keeps on restarting.
> When we track back, we see various exceptions depending on the job, for
> example for one of the jobs, some tasks were failing due to out-of-memory
> exceptions. We resolve the issue by deleting all the task manager pods from
> the Kubernetes cluster. As soon as we delete all task managers, new pods
> are created and the job starts up normally. I feel the reason behind this
> is that the scheduler tries to start up the new job very aggressively, and
> so it is not able to find enough resources.
>
> On Sun, Jul 31, 2022 at 6:59 PM Lijie Wang <wa...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Which version are you using? Has any job failover occurred? It would be
>> better if you can provide the full log of JM.
>>
>> Best,
>> Lijie
>>
>> Hemanga Borah <bo...@gmail.com> 于2022年8月1日周一 01:47写道：
>>
>>> Hello guys,
>>>  We have been seeing an issue with our Flink applications. Our
>>> applications run fine for several hours, and then we see an error/exception
>>> like so:
>>>
>>> java.util.concurrent.CompletionException: java.util.concurrent.CompletionException:
>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>> Could not acquire the minimum required resources.
>>>
>>> For some applications, this error/exception appears once, which stays in
>>> history for a while and but the job recovers. However, for some
>>> applications, we see this error thrown repeatedly, and the application gets
>>> into a crash loop.
>>>
>>> Since our application had been running fine for several hours before we
>>> see such a message, our suspicion is that when the crash happens, the job
>>> manager aggressively tries to start back the job, and is not able to
>>> acquire enough resources because the previous job has not cleaned up as yet.
>>>
>>> Has anyone else been seeing this issue? If so, what did you guys try to
>>> fix it?
>>>
>>> Thanks,
>>> HKB
>>>
>>>

Re: Issues with Flink scheduler?

Posted by Hemanga Borah <bo...@gmail.com>.

We are using 1.14 version currently. The final manifestation of the issue
shows up as the trace I pasted above, and then the job keeps on restarting.
When we track back, we see various exceptions depending on the job, for
example for one of the jobs, some tasks were failing due to out-of-memory
exceptions. We resolve the issue by deleting all the task manager pods from
the Kubernetes cluster. As soon as we delete all task managers, new pods
are created and the job starts up normally. I feel the reason behind this
is that the scheduler tries to start up the new job very aggressively, and
so it is not able to find enough resources.

On Sun, Jul 31, 2022 at 6:59 PM Lijie Wang <wa...@gmail.com> wrote:

> Hi,
>
> Which version are you using? Has any job failover occurred? It would be
> better if you can provide the full log of JM.
>
> Best,
> Lijie
>
> Hemanga Borah <bo...@gmail.com> 于2022年8月1日周一 01:47写道：
>
>> Hello guys,
>>  We have been seeing an issue with our Flink applications. Our
>> applications run fine for several hours, and then we see an error/exception
>> like so:
>>
>> java.util.concurrent.CompletionException: java.util.concurrent.CompletionException:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Could not acquire the minimum required resources.
>>
>> For some applications, this error/exception appears once, which stays in
>> history for a while and but the job recovers. However, for some
>> applications, we see this error thrown repeatedly, and the application gets
>> into a crash loop.
>>
>> Since our application had been running fine for several hours before we
>> see such a message, our suspicion is that when the crash happens, the job
>> manager aggressively tries to start back the job, and is not able to
>> acquire enough resources because the previous job has not cleaned up as yet.
>>
>> Has anyone else been seeing this issue? If so, what did you guys try to
>> fix it?
>>
>> Thanks,
>> HKB
>>
>>

Re: Issues with Flink scheduler?

Posted by Lijie Wang <wa...@gmail.com>.

Hi,

Which version are you using? Has any job failover occurred? It would be
better if you can provide the full log of JM.

Best,
Lijie

Hemanga Borah <bo...@gmail.com> 于2022年8月1日周一 01:47写道：

> Hello guys,
>  We have been seeing an issue with our Flink applications. Our
> applications run fine for several hours, and then we see an error/exception
> like so:
>
> java.util.concurrent.CompletionException: java.util.concurrent.CompletionException:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not acquire the minimum required resources.
>
> For some applications, this error/exception appears once, which stays in
> history for a while and but the job recovers. However, for some
> applications, we see this error thrown repeatedly, and the application gets
> into a crash loop.
>
> Since our application had been running fine for several hours before we
> see such a message, our suspicion is that when the crash happens, the job
> manager aggressively tries to start back the job, and is not able to
> acquire enough resources because the previous job has not cleaned up as yet.
>
> Has anyone else been seeing this issue? If so, what did you guys try to
> fix it?
>
> Thanks,
> HKB
>
>