You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "David.Serafini" <Da...@target.com> on 2018/04/04 01:48:53 UTC
slider 0.92 question
I've been using slider 0.91 for a year and it's been very stable lately.
I built 0.92 to test it and my yarn containers are dying after 10 minutes.
Slider restarts them successfully, but this isn't acceptable behavior.
Any thoughts on what could be going on?
I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
Is there some configuration in my job that I should have changed to use 0.92?
Thanks,
-david
Re: [EXTERNAL] Re: slider 0.92 question
Posted by Manoj Samel <ma...@gmail.com>.
David,
When local disks on the host running node manager are more than 90% full,
nodemanager gives message like "10/12 local-dirs are bad:". In such cases,
the node manager service keeps running but is not servicing any
applications.
Check if the host had multiple disk more than 90% full.
Hope this helps !
Manoj
On Tue, Apr 3, 2018 at 10:59 PM, Gour Saha <gs...@hortonworks.com> wrote:
> Can you check the slider agent logs and the application logs in those
> containers to see if they are failing with some exception?
>
> The fishy thing I found in the AM log are messages like these saying
> "local-dirs are bad". Can you check what's going on with these dirs.?
>
> 2018-04-03 18:38:28,200 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - onNodesUpdated(1)
> 2018-04-03 18:38:28,376 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - Updated nodes [nodeId { host: "***" port: 45454
> } httpAddress: "***:8042" rackName: "/EI105" used { memory: 0
> virtual_cores: 0 } capability { memory: 364544 virtual_cores: 38 }
> node_state: NS_UNHEALTHY health_report: "10/12 local-dirs are bad:
> /grid/9/hadoop/yarn/local,/grid/2/hadoop/yarn/local,/
> grid/1/hadoop/yarn/local,/grid/5/hadoop/yarn/local,/
> grid/11/hadoop/yarn/local,/grid/3/hadoop/yarn/local,/
> grid/8/hadoop/yarn/local,/grid/6/hadoop/yarn/local,/
> grid/0/hadoop/yarn/local,/grid/7/hadoop/yarn/local; 10/12 log-dirs are
> bad: /grid/6/hadoop/yarn/log,/grid/8/hadoop/yarn/log,/grid/2/
> hadoop/yarn/log,/grid/1/hadoop/yarn/log,/grid/5/hadoop/yarn/log,/grid/11/
> hadoop/yarn/log,/grid/7/hadoop/yarn/log,/grid/9/hadoop/yarn/log,/grid/0/
> hadoop/yarn/log,/grid/3/hadoop/yarn/log" last_health_report_time:
> 1522798707678]
>
> -Gour
>
> On 4/3/18, 10:49 PM, "David.Serafini" <Da...@target.com> wrote:
>
> I've attached what I can find.
>
>
> On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
> Can you share the logs of the dying containers and the AM to debug
> further?
>
> -Gour
>
> On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com>
> wrote:
>
> I've been using slider 0.91 for a year and it's been very
> stable lately.
> I built 0.92 to test it and my yarn containers are dying after
> 10 minutes.
> Slider restarts them successfully, but this isn't acceptable
> behavior.
> Any thoughts on what could be going on?
>
> I looked for some kind of release notes for 0.92, but didn't
> find anything except a list of ticket ids.
> Is there some configuration in my job that I should have
> changed to use 0.92?
>
> Thanks,
> -david
>
>
>
>
>
>
>
>
>
Re: [EXTERNAL] Re: slider 0.92 question
Posted by Gour Saha <gs...@hortonworks.com>.
Can you check the slider agent logs and the application logs in those containers to see if they are failing with some exception?
The fishy thing I found in the AM log are messages like these saying "local-dirs are bad". Can you check what's going on with these dirs.?
2018-04-03 18:38:28,200 [AMRM Callback Handler Thread] INFO appmaster.SliderAppMaster - onNodesUpdated(1)
2018-04-03 18:38:28,376 [AMRM Callback Handler Thread] INFO appmaster.SliderAppMaster - Updated nodes [nodeId { host: "***" port: 45454 } httpAddress: "***:8042" rackName: "/EI105" used { memory: 0 virtual_cores: 0 } capability { memory: 364544 virtual_cores: 38 } node_state: NS_UNHEALTHY health_report: "10/12 local-dirs are bad: /grid/9/hadoop/yarn/local,/grid/2/hadoop/yarn/local,/grid/1/hadoop/yarn/local,/grid/5/hadoop/yarn/local,/grid/11/hadoop/yarn/local,/grid/3/hadoop/yarn/local,/grid/8/hadoop/yarn/local,/grid/6/hadoop/yarn/local,/grid/0/hadoop/yarn/local,/grid/7/hadoop/yarn/local; 10/12 log-dirs are bad: /grid/6/hadoop/yarn/log,/grid/8/hadoop/yarn/log,/grid/2/hadoop/yarn/log,/grid/1/hadoop/yarn/log,/grid/5/hadoop/yarn/log,/grid/11/hadoop/yarn/log,/grid/7/hadoop/yarn/log,/grid/9/hadoop/yarn/log,/grid/0/hadoop/yarn/log,/grid/3/hadoop/yarn/log" last_health_report_time: 1522798707678]
-Gour
On 4/3/18, 10:49 PM, "David.Serafini" <Da...@target.com> wrote:
I've attached what I can find.
On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote:
Can you share the logs of the dying containers and the AM to debug further?
-Gour
On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com> wrote:
I've been using slider 0.91 for a year and it's been very stable lately.
I built 0.92 to test it and my yarn containers are dying after 10 minutes.
Slider restarts them successfully, but this isn't acceptable behavior.
Any thoughts on what could be going on?
I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
Is there some configuration in my job that I should have changed to use 0.92?
Thanks,
-david
Re: [EXTERNAL] Re: slider 0.92 question
Posted by "David.Serafini" <Da...@target.com>.
I've attached what I can find.
On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote:
Can you share the logs of the dying containers and the AM to debug further?
-Gour
On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com> wrote:
I've been using slider 0.91 for a year and it's been very stable lately.
I built 0.92 to test it and my yarn containers are dying after 10 minutes.
Slider restarts them successfully, but this isn't acceptable behavior.
Any thoughts on what could be going on?
I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
Is there some configuration in my job that I should have changed to use 0.92?
Thanks,
-david
Re: slider 0.92 question
Posted by Gour Saha <gs...@hortonworks.com>.
Can you share the logs of the dying containers and the AM to debug further?
-Gour
On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com> wrote:
I've been using slider 0.91 for a year and it's been very stable lately.
I built 0.92 to test it and my yarn containers are dying after 10 minutes.
Slider restarts them successfully, but this isn't acceptable behavior.
Any thoughts on what could be going on?
I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
Is there some configuration in my job that I should have changed to use 0.92?
Thanks,
-david