You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Sharad Agarwal <sh...@apache.org> on 2016/03/23 13:20:50 UTC

Leak in RM Capacity scheduler leading to OOM

Taking a dump of 8 GB heap shows about 18 million
org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto

Similar counts are there for ApplicationAttempt, ContainerId. All seems to
be linked via
org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the count of
which is also about 18 million.

On further debugging, looking at the CapacityScheduler code:

It seems to add duplicated entries of UpdatedContainerInfo objects for the
completed containers. In the same dump seeing about 0.5
UpdatedContainerInfo million objects

This issue only surfaces if the scheduler thread is not able to drain fast
enough the UpdatedContainerInfo objects, happens only in a big cluster.

Has anyone noticed the same. We are running hadoop 2.6.0

Sharad

RE: Leak in RM Capacity scheduler leading to OOM

Posted by Rohith Sharma K S <ro...@huawei.com>.

I think you might be hitting with YARN-2997. This issue fixes for sending duplicated completed containers to RM.

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Sharad Agarwal [mailto:sharad@apache.org] 
Sent: 24 March 2016 08:58
To: Sharad Agarwal
Cc: yarn-dev@hadoop.apache.org; user@hadoop.apache.org
Subject: Re: Leak in RM Capacity scheduler leading to OOM

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million 
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All 
> seems to be linked via 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the 
> count of which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for 
> the completed containers. In the same dump seeing about 0.5 
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain 
> fast enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

RE: Leak in RM Capacity scheduler leading to OOM

Posted by Rohith Sharma K S <ro...@huawei.com>.

I think you might be hitting with YARN-2997. This issue fixes for sending duplicated completed containers to RM.

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Sharad Agarwal [mailto:sharad@apache.org] 
Sent: 24 March 2016 08:58
To: Sharad Agarwal
Cc: yarn-dev@hadoop.apache.org; user@hadoop.apache.org
Subject: Re: Leak in RM Capacity scheduler leading to OOM

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million 
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All 
> seems to be linked via 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the 
> count of which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for 
> the completed containers. In the same dump seeing about 0.5 
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain 
> fast enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

RE: Leak in RM Capacity scheduler leading to OOM

Posted by Rohith Sharma K S <ro...@huawei.com>.

I think you might be hitting with YARN-2997. This issue fixes for sending duplicated completed containers to RM.

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Sharad Agarwal [mailto:sharad@apache.org] 
Sent: 24 March 2016 08:58
To: Sharad Agarwal
Cc: yarn-dev@hadoop.apache.org; user@hadoop.apache.org
Subject: Re: Leak in RM Capacity scheduler leading to OOM

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million 
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All 
> seems to be linked via 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the 
> count of which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for 
> the completed containers. In the same dump seeing about 0.5 
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain 
> fast enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

RE: Leak in RM Capacity scheduler leading to OOM

Posted by Rohith Sharma K S <ro...@huawei.com>.

I think you might be hitting with YARN-2997. This issue fixes for sending duplicated completed containers to RM.

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Sharad Agarwal [mailto:sharad@apache.org] 
Sent: 24 March 2016 08:58
To: Sharad Agarwal
Cc: yarn-dev@hadoop.apache.org; user@hadoop.apache.org
Subject: Re: Leak in RM Capacity scheduler leading to OOM

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million 
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All 
> seems to be linked via 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the 
> count of which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for 
> the completed containers. In the same dump seeing about 0.5 
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain 
> fast enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

RE: Leak in RM Capacity scheduler leading to OOM

Posted by Rohith Sharma K S <ro...@huawei.com>.

I think you might be hitting with YARN-2997. This issue fixes for sending duplicated completed containers to RM.

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Sharad Agarwal [mailto:sharad@apache.org] 
Sent: 24 March 2016 08:58
To: Sharad Agarwal
Cc: yarn-dev@hadoop.apache.org; user@hadoop.apache.org
Subject: Re: Leak in RM Capacity scheduler leading to OOM

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million 
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All 
> seems to be linked via 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the 
> count of which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for 
> the completed containers. In the same dump seeing about 0.5 
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain 
> fast enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

Re: Leak in RM Capacity scheduler leading to OOM

Posted by Sharad Agarwal <sh...@apache.org>.

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All seems to
> be linked via
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the count of
> which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for the
> completed containers. In the same dump seeing about 0.5
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain fast
> enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

Re: Leak in RM Capacity scheduler leading to OOM

Posted by Sharad Agarwal <sh...@apache.org>.

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All seems to
> be linked via
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the count of
> which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for the
> completed containers. In the same dump seeing about 0.5
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain fast
> enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

Re: Leak in RM Capacity scheduler leading to OOM

Posted by Sharad Agarwal <sh...@apache.org>.

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All seems to
> be linked via
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the count of
> which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for the
> completed containers. In the same dump seeing about 0.5
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain fast
> enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

Re: Leak in RM Capacity scheduler leading to OOM

Posted by Sharad Agarwal <sh...@apache.org>.

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All seems to
> be linked via
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the count of
> which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for the
> completed containers. In the same dump seeing about 0.5
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain fast
> enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>

Re: Leak in RM Capacity scheduler leading to OOM

Posted by Sharad Agarwal <sh...@apache.org>.

Ticket for this is here ->
https://issues.apache.org/jira/browse/YARN-4852

On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal <sh...@apache.org> wrote:

> Taking a dump of 8 GB heap shows about 18 million
> org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
>
> Similar counts are there for ApplicationAttempt, ContainerId. All seems to
> be linked via
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the count of
> which is also about 18 million.
>
> On further debugging, looking at the CapacityScheduler code:
>
> It seems to add duplicated entries of UpdatedContainerInfo objects for the
> completed containers. In the same dump seeing about 0.5
> UpdatedContainerInfo million objects
>
> This issue only surfaces if the scheduler thread is not able to drain fast
> enough the UpdatedContainerInfo objects, happens only in a big cluster.
>
> Has anyone noticed the same. We are running hadoop 2.6.0
>
> Sharad
>