You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by qi luo <lu...@gmail.com> on 2019/03/29 12:09:31 UTC

Infinitely requesting for Yarn container in Flink 1.5

Hello,

Today we encountered an issue where our Flink job request for Yarn container infinitely. In the JM log as below, there were errors when starting TMs (caused by underlying HDFS errors). So the allocated container failed and the job kept requesting for new containers. The failed containers were also not returned the the Yarn, so this job quickly exhausted our Yarn resources. 

Is there any way we can avoid such behavior? Thank you!

————————
JM log:

INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : xxx.yyy
ERROR org.apache.flink.yarn.YarnResourceManager                     - Could not start TaskManager in container container_e12345.
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
....
INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:16384, vCores:4>. Number pending requests 19.
INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e195_1553781735010_27100_01_000136 - Remaining pending container requests: 19
————————

Thanks,
Qi

Re: Infinitely requesting for Yarn container in Flink 1.5

Posted by qi luo <lu...@gmail.com>.

Thanks Rong, I will follow that issue.

> On Mar 30, 2019, at 6:42 AM, Rong Rong <walterddr@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi Qi,
> 
> I think the problem may be related to another similar problem reported in a previous JIRA [1]. I think a PR is also in discussion.
> 
> Thanks,
> Rong
> 
> [1] https://issues.apache.org/jira/browse/FLINK-10868 <https://issues.apache.org/jira/browse/FLINK-10868>
> On Fri, Mar 29, 2019 at 5:09 AM qi luo <luoqi.bd@gmail.com <ma...@gmail.com>> wrote:
> Hello,
> 
> Today we encountered an issue where our Flink job request for Yarn container infinitely. In the JM log as below, there were errors when starting TMs (caused by underlying HDFS errors). So the allocated container failed and the job kept requesting for new containers. The failed containers were also not returned the the Yarn, so this job quickly exhausted our Yarn resources. 
> 
> Is there any way we can avoid such behavior? Thank you!
> 
> ————————
> JM log:
> 
> INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
> INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
> INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : xxx.yyy
> ERROR org.apache.flink.yarn.YarnResourceManager                     - Could not start TaskManager in container container_e12345.
> org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
> ....
> INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:16384, vCores:4>. Number pending requests 19.
> INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e195_1553781735010_27100_01_000136 - Remaining pending container requests: 19
> ————————
> 
> Thanks,
> Qi

Re: Infinitely requesting for Yarn container in Flink 1.5

Posted by Peter Huang <hu...@gmail.com>.

Hi Qi,

The current version of PR is runnable in production. But according to
Till's suggestion, It needs one more round of change.


Best Regards
Peter Huang

On Fri, Mar 29, 2019 at 3:42 PM Rong Rong <wa...@gmail.com> wrote:

> Hi Qi,
>
> I think the problem may be related to another similar problem reported in
> a previous JIRA [1]. I think a PR is also in discussion.
>
> Thanks,
> Rong
>
> [1] https://issues.apache.org/jira/browse/FLINK-10868
>
> On Fri, Mar 29, 2019 at 5:09 AM qi luo <lu...@gmail.com> wrote:
>
>> Hello,
>>
>> Today we encountered an issue where our Flink job request for Yarn
>> container infinitely. In the JM log as below, there were errors when
>> starting TMs (caused by underlying HDFS errors). So the allocated container
>> failed and the job kept requesting for new containers. The failed
>> containers were also not returned the the Yarn, so this job quickly
>> exhausted our Yarn resources.
>>
>> Is there any way we can avoid such behavior? Thank you!
>>
>> ————————
>> JM log:
>>
>> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
>> Creating container launch context for TaskManagers*
>> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
>> Starting TaskManagers*
>> *INFO
>>  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  -
>> Opening proxy : xxx.yyy*
>> *ERROR org.apache.flink.yarn.YarnResourceManager                     -
>> Could not start TaskManager in container container_e12345.*
>> *org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to
>> start container.*
>> *....*
>> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
>> Requesting new TaskExecutor container with resources <memory:16384,
>> vCores:4>. Number pending requests 19.*
>> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
>> Received new container: container_e195_1553781735010_27100_01_000136 -
>> Remaining pending container requests: 19*
>> ————————
>>
>> Thanks,
>> Qi
>>
>

Re: Infinitely requesting for Yarn container in Flink 1.5

Posted by Rong Rong <wa...@gmail.com>.

Hi Qi,

I think the problem may be related to another similar problem reported in a
previous JIRA [1]. I think a PR is also in discussion.

Thanks,
Rong

[1] https://issues.apache.org/jira/browse/FLINK-10868

On Fri, Mar 29, 2019 at 5:09 AM qi luo <lu...@gmail.com> wrote:

> Hello,
>
> Today we encountered an issue where our Flink job request for Yarn
> container infinitely. In the JM log as below, there were errors when
> starting TMs (caused by underlying HDFS errors). So the allocated container
> failed and the job kept requesting for new containers. The failed
> containers were also not returned the the Yarn, so this job quickly
> exhausted our Yarn resources.
>
> Is there any way we can avoid such behavior? Thank you!
>
> ————————
> JM log:
>
> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
> Creating container launch context for TaskManagers*
> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
> Starting TaskManagers*
> *INFO
>  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  -
> Opening proxy : xxx.yyy*
> *ERROR org.apache.flink.yarn.YarnResourceManager                     -
> Could not start TaskManager in container container_e12345.*
> *org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to
> start container.*
> *....*
> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
> Requesting new TaskExecutor container with resources <memory:16384,
> vCores:4>. Number pending requests 19.*
> *INFO  org.apache.flink.yarn.YarnResourceManager                     -
> Received new container: container_e195_1553781735010_27100_01_000136 -
> Remaining pending container requests: 19*
> ————————
>
> Thanks,
> Qi
>