You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Borys Gogulski <bo...@mapp.com> on 2018/10/03 09:35:56 UTC

Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

Hey,

We're running Flink 1.5.2 (I know there's 1.5.4 and 1.6.1) on YARN for some jobs we're processing. It's a "long running" container to which we're submitting jobs - all jobs submitted to that container have got parallelism of 32 (to be precise: in this job there are 8 subtasks with parallelism 32 and one subtask with parallelism 1), we're running max 8 of them. TMs are set to have one slot only and 6GB RAM each.
On the beginning, when using Flink 1.5.0 and 1.5.1 with the "on-demand" resources policy we were noticing that more containers than it's required are spawned but with Flink 1.5.2 it "stabilized" - there were obviously some containers kept for some time after job finished (and no additional job was submitted to take those resources) but overhead wasn't big so we were "all good".
And here's the plot twist.
For couple days now we're witnessing situations in which spawning one job makes Flink request couple hundreds of TMs. Additionally in JM's logs we can find dozens of lines like:
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594295 - Remaining pending container requests: 0
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594295.
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594300 - Remaining pending container requests: 0
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594300.
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594303 - Remaining pending container requests: 0
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594303.
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594304 - Remaining pending container requests: 0
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594304.
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594334 - Remaining pending container requests: 0
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594334.
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594337 - Remaining pending container requests: 0
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594337.
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594152 - Remaining pending container requests: 0
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594152.
2018-10-03 11:08:27,186 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e96_1538374332137_0793_01_594410 - Remaining pending container requests: 0
2018-10-03 11:08:27,187 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e96_1538374332137_0793_01_594410.
Only change made last week seems to be adding 5 new nodes to YARN Cluster.
Any ideas why it's requesting so many containers? Any ideas why there's this "Received/Returning" flood? Right now one job was started and out of a sudden 352 containers were requested from YARN (also almost closing YARN's queue on RAM)

We're also experiencing JMs hangs (we can't view UI + TMs can't communicate with JM) but first I'd like to resolve above "issue" as it might be cause for rest of our problems.

Best regards,
Borys Gogulski

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Borys,

I remember that another user reported a similar issue recently [1] –
attached
to the ticket you can find his log file. If I recall correctly, we concluded
that YARN returned the containers very quickly. At the time, Flink's debug
level logs were inconclusive because we did not log the reason why the
container was returned, and the user could not provide us the YARN logs. In
1.5.4, we improved the logging [2]. Hence, it would be good if you can
reproduce this with debug level logging using Flink 1.5.4.

You could also try the vanilla Hadoop distribution, or Flink 1.6.

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-10104
[2] https://issues.apache.org/jira/browse/FLINK-10137

On Tue, Oct 9, 2018 at 5:55 PM Borys Gogulski <bo...@mapp.com>
wrote:

> Hey guys,
>
> thanks for the replies.
> 1. "Requesting new TaskExecutor" looks fine as it's exactly 32 as is jobs'
> parallelism set.
> The weird thing is that after those 32 containers requested and received we
> have this "flood" of 'Received new container/Returning excess container`
> (and as shown below it's actually doing something on YARN side)
> Where does those come from?
> 2. I felt that DEBUG will be needed, we'll see what we can do about it.
> 3. Yes, all in favor for upgrading to 1.5.4. But as Gary mentioned there
> seems to be no fixes that could heal it (I was reading release notes
> previous to posting this thread ; )).
> 4. Hadoop: 2.6.0+cdh5.14.0
>
> Here are logs for one of "excess" containers:
> 1. Flink JM
> 2018-10-09 17:35:33,493 INFO  org.apache.flink.yarn.YarnResourceManager
>
> - Received new container: container_e96_1538374332137_3071_01_2485560 -
> Remaining pending container requests: 0
> 2018-10-09 17:35:33,493 INFO  org.apache.flink.yarn.YarnResourceManager
>
> - Returning excess container container_e96_1538374332137_3071_01_2485560.
> 2. YARN
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e96_1538374332137_3071_01_2485560 Container Transitioned from NEW
> to ALLOCATED
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user
>
> OPERATION=AM Allocated Container        TARGET=SchedulerApp
> RESULT=SUCCESS  APPID=application_1538374332137_3071
> CONTAINERID=container_e96_1538374332137_3071_01_2485560
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
> Assigned container container_e96_1538374332137_3071_01_2485560 of capacity
> <memory:6144, vCores:1> on host server:44142, which has 5 containers,
> <memory:30720, vCores:5> used and <memory:2048, vCores:11> available after
> allocation
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> assignedContainer application attempt=appattempt_1538374332137_3071_000001
> container=Container: [ContainerId:
> container_e96_1538374332137_3071_01_2485560, NodeId: server:44142,
> NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority:
> 0, Token: null, ] queue=queue: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:2353152, vCores:383>, usedCapacity=1.9947916,
> absoluteUsedCapacity=0.9973958, numApps=2, numContainers=383
> clusterResource=<memory:2359296, vCores:1152>
> 2018-10-09 17:35:33,485 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e96_1538374332137_3071_01_2485560 Container Transitioned from
> ALLOCATED to ACQUIRED
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e96_1538374332137_3071_01_2485560 Container Transitioned from
> ACQUIRED to RELEASED
> 2018-10-09 17:35:38,532 INFO
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
> Completed container: container_e96_1538374332137_3071_01_2485560 in state:
> RELEASED event:RELEASED
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user
>
> IP=ip      OPERATION=AM Released Container TARGET=SchedulerApp
> RESULT=SUCCESS  APPID=application_1538374332137_3071
> CONTAINERID=container_e96_1538374332137_3071_01_2485560
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
> Released container container_e96_1538374332137_3071_01_2485560 of capacity
> <memory:6144, vCores:1> on host server:44142, which currently has 0
> containers, <memory:0, vCores:0> used and <memory:32768, vCores:16>
> available, release resources=true
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> completedContainer container=Container: [ContainerId:
> container_e96_1538374332137_3071_01_2485560, NodeId: server:44142,
> NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority:
> 0, Token: Token { kind: ContainerToken, service: ip:44142 }, ] queue=queue:
> capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:589824,
> vCores:96>, usedCapacity=0.5, absoluteUsedCapacity=0.25, numApps=2,
> numContainers=96 cluster=<memory:2359296, vCores:1152>
> 2018-10-09 17:35:38,532 INFO
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Application attempt appattempt_1538374332137_3071_000001 released container
> container_e96_1538374332137_3071_01_2485560 on node: host: server:44142
> #containers=0 available=32768 used=0 with event: RELEASED
>
> Best regards,
> Borys Gogulski
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

Posted by Borys Gogulski <bo...@mapp.com>.

Hey guys,

thanks for the replies.
1. "Requesting new TaskExecutor" looks fine as it's exactly 32 as is jobs'
parallelism set.
The weird thing is that after those 32 containers requested and received we
have this "flood" of 'Received new container/Returning excess container`
(and as shown below it's actually doing something on YARN side)
Where does those come from?
2. I felt that DEBUG will be needed, we'll see what we can do about it.
3. Yes, all in favor for upgrading to 1.5.4. But as Gary mentioned there
seems to be no fixes that could heal it (I was reading release notes
previous to posting this thread ; )).
4. Hadoop: 2.6.0+cdh5.14.0

Here are logs for one of "excess" containers:
1. Flink JM
2018-10-09 17:35:33,493 INFO  org.apache.flink.yarn.YarnResourceManager                    
- Received new container: container_e96_1538374332137_3071_01_2485560 -
Remaining pending container requests: 0
2018-10-09 17:35:33,493 INFO  org.apache.flink.yarn.YarnResourceManager                    
- Returning excess container container_e96_1538374332137_3071_01_2485560.
2. YARN
2018-10-09 17:35:33,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e96_1538374332137_3071_01_2485560 Container Transitioned from NEW
to ALLOCATED
2018-10-09 17:35:33,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user       
OPERATION=AM Allocated Container        TARGET=SchedulerApp    
RESULT=SUCCESS  APPID=application_1538374332137_3071      
CONTAINERID=container_e96_1538374332137_3071_01_2485560
2018-10-09 17:35:33,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Assigned container container_e96_1538374332137_3071_01_2485560 of capacity
<memory:6144, vCores:1> on host server:44142, which has 5 containers,
<memory:30720, vCores:5> used and <memory:2048, vCores:11> available after
allocation
2018-10-09 17:35:33,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
assignedContainer application attempt=appattempt_1538374332137_3071_000001
container=Container: [ContainerId:
container_e96_1538374332137_3071_01_2485560, NodeId: server:44142,
NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority:
0, Token: null, ] queue=queue: capacity=0.5, absoluteCapacity=0.5,
usedResources=<memory:2353152, vCores:383>, usedCapacity=1.9947916,
absoluteUsedCapacity=0.9973958, numApps=2, numContainers=383
clusterResource=<memory:2359296, vCores:1152>
2018-10-09 17:35:33,485 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e96_1538374332137_3071_01_2485560 Container Transitioned from
ALLOCATED to ACQUIRED
2018-10-09 17:35:38,532 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e96_1538374332137_3071_01_2485560 Container Transitioned from
ACQUIRED to RELEASED
2018-10-09 17:35:38,532 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Completed container: container_e96_1538374332137_3071_01_2485560 in state:
RELEASED event:RELEASED
2018-10-09 17:35:38,532 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user       
IP=ip      OPERATION=AM Released Container TARGET=SchedulerApp    
RESULT=SUCCESS  APPID=application_1538374332137_3071      
CONTAINERID=container_e96_1538374332137_3071_01_2485560
2018-10-09 17:35:38,532 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Released container container_e96_1538374332137_3071_01_2485560 of capacity
<memory:6144, vCores:1> on host server:44142, which currently has 0
containers, <memory:0, vCores:0> used and <memory:32768, vCores:16>
available, release resources=true
2018-10-09 17:35:38,532 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_e96_1538374332137_3071_01_2485560, NodeId: server:44142,
NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority:
0, Token: Token { kind: ContainerToken, service: ip:44142 }, ] queue=queue:
capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:589824,
vCores:96>, usedCapacity=0.5, absoluteUsedCapacity=0.25, numApps=2,
numContainers=96 cluster=<memory:2359296, vCores:1152>
2018-10-09 17:35:38,532 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1538374332137_3071_000001 released container
container_e96_1538374332137_3071_01_2485560 on node: host: server:44142
#containers=0 available=32768 used=0 with event: RELEASED

Best regards,
Borys Gogulski



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

Posted by Till Rohrmann <tr...@apache.org>.

Hi Borys,

if possible the complete logs of the JM (DEBUG log level) would be helpful
to further debug the problem. Have there been recovery operations lately?

Cheers,
Till

On Sat, Oct 6, 2018 at 11:15 AM Gary Yao <ga...@data-artisans.com> wrote:

> Hi Borys,
>
> To debug how many containers Flink is requesting, you can look out for the
> log
> statement below [1]:
>
>     Requesting new TaskExecutor container with resources [...]
>
> If you need help debugging, can you attach the full JM logs (preferably on
> DEBUG level)? Would it be possible for you to test against 1.5.3 and 1.5.4?
> However, I am not aware of any related issues that were fixed for 1.5.3 or
> 1.5.4. What is the Hadoop distribution that you are using?
>
> Best,
> Gary
>
> [1]
> https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L454
>
> On Wed, Oct 3, 2018 at 11:36 AM Borys Gogulski <bo...@mapp.com>
> wrote:
>
>> Hey,
>>
>>
>>
>> We’re running Flink 1.5.2 (I know there’s 1.5.4 and 1.6.1) on YARN for
>> some jobs we’re processing. It’s a “long running” container to which we’re
>> submitting jobs – all jobs submitted to that container have got parallelism
>> of 32 (to be precise: in this job there are 8 subtasks with parallelism 32
>> and one subtask with parallelism 1), we’re running max 8 of them. TMs are
>> set to have one slot only and 6GB RAM each.
>> On the beginning, when using Flink 1.5.0 and 1.5.1 with the “on-demand”
>> resources policy we were noticing that more containers than it’s required
>> are spawned but with Flink 1.5.2 it “stabilized” – there were obviously
>> some containers kept for some time after job finished (and no additional
>> job was submitted to take those resources) but overhead wasn’t big so we
>> were “all good”.
>> And here’s the plot twist.
>> For couple days now we’re witnessing situations in which spawning one job
>> makes Flink request couple hundreds of TMs. Additionally in JM’s logs we
>> can find dozens of lines like:
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594295 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594295.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594300 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594300.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594303 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594303.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594304 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594304.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594334 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594334.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594337 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594337.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594152 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594152.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594410 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,187 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594410.
>>
>> Only change made last week seems to be adding 5 new nodes to YARN Cluster.
>> Any ideas why it’s requesting so many containers? Any ideas why there’s
>> this “Received/Returning” flood? Right now one job was started and out of a
>> sudden 352 containers were requested from YARN (also almost closing YARN’s
>> queue on RAM)
>>
>>
>>
>> We’re also experiencing JMs hangs (we can’t view UI + TMs can’t
>> communicate with JM) but first I’d like to resolve above “issue” as it
>> might be cause for rest of our problems.
>>
>>
>>
>> Best regards,
>> Borys Gogulski
>>
>

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Borys,

To debug how many containers Flink is requesting, you can look out for the
log
statement below [1]:

    Requesting new TaskExecutor container with resources [...]

If you need help debugging, can you attach the full JM logs (preferably on
DEBUG level)? Would it be possible for you to test against 1.5.3 and 1.5.4?
However, I am not aware of any related issues that were fixed for 1.5.3 or
1.5.4. What is the Hadoop distribution that you are using?

Best,
Gary

[1]
https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L454

On Wed, Oct 3, 2018 at 11:36 AM Borys Gogulski <bo...@mapp.com>
wrote:

> Hey,
>
>
>
> We’re running Flink 1.5.2 (I know there’s 1.5.4 and 1.6.1) on YARN for
> some jobs we’re processing. It’s a “long running” container to which we’re
> submitting jobs – all jobs submitted to that container have got parallelism
> of 32 (to be precise: in this job there are 8 subtasks with parallelism 32
> and one subtask with parallelism 1), we’re running max 8 of them. TMs are
> set to have one slot only and 6GB RAM each.
> On the beginning, when using Flink 1.5.0 and 1.5.1 with the “on-demand”
> resources policy we were noticing that more containers than it’s required
> are spawned but with Flink 1.5.2 it “stabilized” – there were obviously
> some containers kept for some time after job finished (and no additional
> job was submitted to take those resources) but overhead wasn’t big so we
> were “all good”.
> And here’s the plot twist.
> For couple days now we’re witnessing situations in which spawning one job
> makes Flink request couple hundreds of TMs. Additionally in JM’s logs we
> can find dozens of lines like:
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594295 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594295.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594300 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594300.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594303 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594303.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594304 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594304.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594334 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594334.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594337 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594337.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594152 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594152.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594410 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,187 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594410.
>
> Only change made last week seems to be adding 5 new nodes to YARN Cluster.
> Any ideas why it’s requesting so many containers? Any ideas why there’s
> this “Received/Returning” flood? Right now one job was started and out of a
> sudden 352 containers were requested from YARN (also almost closing YARN’s
> queue on RAM)
>
>
>
> We’re also experiencing JMs hangs (we can’t view UI + TMs can’t
> communicate with JM) but first I’d like to resolve above “issue” as it
> might be cause for rest of our problems.
>
>
>
> Best regards,
> Borys Gogulski
>