You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@slider.apache.org by Thomas Weise <th...@gmail.com> on 2015/05/12 18:34:16 UTC

Container recovery on working on CDH with yarn.component.placement.policy=1

We are testing KOYA on CDH 5.4. We see that after killing the container
Slider as expected will ask for the same host. The request is never filled
and the container cannot be redeployed. We see this behavior on CDH with
DataTorrent also, it looks like a CDH bug.

Anyone else trying to run Slider on CDH and sees the same behavior? Any
insight on whether that is a CDH configuration issue or fair scheduler bug?

Thanks,
Thomas

Re: Container recovery on working on CDH with yarn.component.placement.policy=1

Posted by Thomas Weise <th...@gmail.com>.

Jean,

Curious what your findings will be with the capacity scheduler. The cluster
I'm using has the fair scheduler (CDH default) and we see this issue with
other applications also. Works fine on HDP 2.2.

The resource manager is logging at INFO level only and I cannot mock with
it at this time. There is nothing in the log indicating a problem with the
container request.

Thomas

On Tue, May 19, 2015 at 11:14 AM, Jean-Baptiste Note <jb...@gmail.com>
wrote:

> Hi Thomas,
>
> I'm also testing on CDH5.4, so i'll be able to attempt duplicating this
> after my vacation (next week).
> I'm using the capacity scheduler on a secure cluster though.
>
> You probably should be able to see what's going on by increasing the log
> verbosity of the RM and/or NM -- I don't know if debug level will trace the
> RPCs, but I guess it should.
> Of course you could also log client side, but you may be less familiar with
> the code.
>
> Kind regards,
> JB
>

Re: Container recovery on working on CDH with yarn.component.placement.policy=1

Posted by Jean-Baptiste Note <jb...@gmail.com>.

Hi Thomas,

I'm also testing on CDH5.4, so i'll be able to attempt duplicating this
after my vacation (next week).
I'm using the capacity scheduler on a secure cluster though.

You probably should be able to see what's going on by increasing the log
verbosity of the RM and/or NM -- I don't know if debug level will trace the
RPCs, but I guess it should.
Of course you could also log client side, but you may be less familiar with
the code.

Kind regards,
JB

Re: Container recovery on working on CDH with yarn.component.placement.policy=1

Posted by Gour Saha <gs...@hortonworks.com>.

Thomas,
Resources.json looks ok. Can you send me the following logs so that I can
look further into it -

- Slider AM log
- Slider agent log (for the container that was killed)
- RM log
- NM log from the node where Slider agent (that was killed) was running

-Gour

On 5/19/15, 8:43 AM, "Thomas Weise" <th...@gmail.com> wrote:

>All resources are freed up. The AM requests the replacement container and
>nothing happens after that. Please see:
>
>https://www.dropbox.com/sh/8ub0jedh60cgys4/AACPftofPcdhD5Sb2XADRMTga?dl=0
>
>resources.json
>
>{
>  "schema" : "http://example.org/specification/v2.0.0",
>  "metadata" : {
>  },
>  "global" : {
>    "yarn.container.failure.threshold":"10",
>    "yarn.container.failure.window.hours":"1"
>  },
>  "components" : {
>    "broker" : {
>      "yarn.role.priority" : "1",
>      "yarn.component.instances" : "3",
>      "yarn.memory" : "768",
>      "yarn.vcores" : "1",
>      "yarn.component.placement.policy":"1"
>    },
>    "slider-appmaster" : {
>    }
>  }
>}
>
>
>On Wed, May 13, 2015 at 5:03 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> Can you check the resources (memory, cpu) available in the host, after
>> killing the container? Is it freed? Can you hit the RM UI and share what
>> you see in the ³Cluster Metrics² table for that node?
>>
>> Also, if possible please share your resources.json.
>>
>> -Gour
>>
>> On 5/12/15, 9:34 AM, "Thomas Weise" <th...@gmail.com> wrote:
>>
>> >We are testing KOYA on CDH 5.4. We see that after killing the container
>> >Slider as expected will ask for the same host. The request is never
>>filled
>> >and the container cannot be redeployed. We see this behavior on CDH
>>with
>> >DataTorrent also, it looks like a CDH bug.
>> >
>> >Anyone else trying to run Slider on CDH and sees the same behavior? Any
>> >insight on whether that is a CDH configuration issue or fair scheduler
>> >bug?
>> >
>> >Thanks,
>> >Thomas
>>
>>

Re: Container recovery on working on CDH with yarn.component.placement.policy=1

Posted by Thomas Weise <th...@gmail.com>.

All resources are freed up. The AM requests the replacement container and
nothing happens after that. Please see:

https://www.dropbox.com/sh/8ub0jedh60cgys4/AACPftofPcdhD5Sb2XADRMTga?dl=0

resources.json

{
  "schema" : "http://example.org/specification/v2.0.0",
  "metadata" : {
  },
  "global" : {
    "yarn.container.failure.threshold":"10",
    "yarn.container.failure.window.hours":"1"
  },
  "components" : {
    "broker" : {
      "yarn.role.priority" : "1",
      "yarn.component.instances" : "3",
      "yarn.memory" : "768",
      "yarn.vcores" : "1",
      "yarn.component.placement.policy":"1"
    },
    "slider-appmaster" : {
    }
  }
}


On Wed, May 13, 2015 at 5:03 PM, Gour Saha <gs...@hortonworks.com> wrote:

> Can you check the resources (memory, cpu) available in the host, after
> killing the container? Is it freed? Can you hit the RM UI and share what
> you see in the ³Cluster Metrics² table for that node?
>
> Also, if possible please share your resources.json.
>
> -Gour
>
> On 5/12/15, 9:34 AM, "Thomas Weise" <th...@gmail.com> wrote:
>
> >We are testing KOYA on CDH 5.4. We see that after killing the container
> >Slider as expected will ask for the same host. The request is never filled
> >and the container cannot be redeployed. We see this behavior on CDH with
> >DataTorrent also, it looks like a CDH bug.
> >
> >Anyone else trying to run Slider on CDH and sees the same behavior? Any
> >insight on whether that is a CDH configuration issue or fair scheduler
> >bug?
> >
> >Thanks,
> >Thomas
>
>

Re: Container recovery on working on CDH with yarn.component.placement.policy=1

Posted by Gour Saha <gs...@hortonworks.com>.

Can you check the resources (memory, cpu) available in the host, after
killing the container? Is it freed? Can you hit the RM UI and share what
you see in the ³Cluster Metrics² table for that node?

Also, if possible please share your resources.json.

-Gour

On 5/12/15, 9:34 AM, "Thomas Weise" <th...@gmail.com> wrote:

>We are testing KOYA on CDH 5.4. We see that after killing the container
>Slider as expected will ask for the same host. The request is never filled
>and the container cannot be redeployed. We see this behavior on CDH with
>DataTorrent also, it looks like a CDH bug.
>
>Anyone else trying to run Slider on CDH and sees the same behavior? Any
>insight on whether that is a CDH configuration issue or fair scheduler
>bug?
>
>Thanks,
>Thomas