You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@slider.apache.org by "David.Serafini" <Da...@target.com> on 2018/04/04 01:48:53 UTC

slider 0.92 question

I've been using slider 0.91 for a year and it's been very stable lately.
I built 0.92 to test it and my yarn containers are dying after 10 minutes.
Slider restarts them successfully, but this isn't acceptable behavior.
Any thoughts on what could be going on?  

I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
Is there some configuration in my job that I should have changed to use 0.92?

Thanks,
-david

Re: [EXTERNAL] Re: slider 0.92 question

Posted by Manoj Samel <ma...@gmail.com>.

David,

When local disks on the host running node manager are more than 90% full,
nodemanager gives message like  "10/12 local-dirs are bad:". In such cases,
the node manager service keeps running but is not servicing any
applications.

Check if the host had multiple disk more than 90% full.

Hope this helps !

Manoj

On Tue, Apr 3, 2018 at 10:59 PM, Gour Saha <gs...@hortonworks.com> wrote:

> Can you check the slider agent logs and the application logs in those
> containers to see if they are failing with some exception?
>
> The fishy thing I found in the AM log are messages like these saying
> "local-dirs are bad". Can you check what's going on with these dirs.?
>
> 2018-04-03 18:38:28,200 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - onNodesUpdated(1)
> 2018-04-03 18:38:28,376 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - Updated nodes [nodeId { host: "***" port: 45454
> } httpAddress: "***:8042" rackName: "/EI105" used { memory: 0
> virtual_cores: 0 } capability { memory: 364544 virtual_cores: 38 }
> node_state: NS_UNHEALTHY health_report: "10/12 local-dirs are bad:
> /grid/9/hadoop/yarn/local,/grid/2/hadoop/yarn/local,/
> grid/1/hadoop/yarn/local,/grid/5/hadoop/yarn/local,/
> grid/11/hadoop/yarn/local,/grid/3/hadoop/yarn/local,/
> grid/8/hadoop/yarn/local,/grid/6/hadoop/yarn/local,/
> grid/0/hadoop/yarn/local,/grid/7/hadoop/yarn/local; 10/12 log-dirs are
> bad: /grid/6/hadoop/yarn/log,/grid/8/hadoop/yarn/log,/grid/2/
> hadoop/yarn/log,/grid/1/hadoop/yarn/log,/grid/5/hadoop/yarn/log,/grid/11/
> hadoop/yarn/log,/grid/7/hadoop/yarn/log,/grid/9/hadoop/yarn/log,/grid/0/
> hadoop/yarn/log,/grid/3/hadoop/yarn/log" last_health_report_time:
> 1522798707678]
>
> -Gour
>
> On 4/3/18, 10:49 PM, "David.Serafini" <Da...@target.com> wrote:
>
>     I've attached what I can find.
>
>
>     On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>         Can you share the logs of the dying containers and the AM to debug
> further?
>
>         -Gour
>
>         On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com>
> wrote:
>
>             I've been using slider 0.91 for a year and it's been very
> stable lately.
>             I built 0.92 to test it and my yarn containers are dying after
> 10 minutes.
>             Slider restarts them successfully, but this isn't acceptable
> behavior.
>             Any thoughts on what could be going on?
>
>             I looked for some kind of release notes for 0.92, but didn't
> find anything except a list of ticket ids.
>             Is there some configuration in my job that I should have
> changed to use 0.92?
>
>             Thanks,
>             -david
>
>
>
>
>
>
>
>
>

Re: [EXTERNAL] Re: slider 0.92 question

Posted by Gour Saha <gs...@hortonworks.com>.

Can you check the slider agent logs and the application logs in those containers to see if they are failing with some exception?

The fishy thing I found in the AM log are messages like these saying "local-dirs are bad". Can you check what's going on with these dirs.?

2018-04-03 18:38:28,200 [AMRM Callback Handler Thread] INFO  appmaster.SliderAppMaster - onNodesUpdated(1)
2018-04-03 18:38:28,376 [AMRM Callback Handler Thread] INFO  appmaster.SliderAppMaster - Updated nodes [nodeId { host: "***" port: 45454 } httpAddress: "***:8042" rackName: "/EI105" used { memory: 0 virtual_cores: 0 } capability { memory: 364544 virtual_cores: 38 } node_state: NS_UNHEALTHY health_report: "10/12 local-dirs are bad: /grid/9/hadoop/yarn/local,/grid/2/hadoop/yarn/local,/grid/1/hadoop/yarn/local,/grid/5/hadoop/yarn/local,/grid/11/hadoop/yarn/local,/grid/3/hadoop/yarn/local,/grid/8/hadoop/yarn/local,/grid/6/hadoop/yarn/local,/grid/0/hadoop/yarn/local,/grid/7/hadoop/yarn/local; 10/12 log-dirs are bad: /grid/6/hadoop/yarn/log,/grid/8/hadoop/yarn/log,/grid/2/hadoop/yarn/log,/grid/1/hadoop/yarn/log,/grid/5/hadoop/yarn/log,/grid/11/hadoop/yarn/log,/grid/7/hadoop/yarn/log,/grid/9/hadoop/yarn/log,/grid/0/hadoop/yarn/log,/grid/3/hadoop/yarn/log" last_health_report_time: 1522798707678]

-Gour

On 4/3/18, 10:49 PM, "David.Serafini" <Da...@target.com> wrote:

    I've attached what I can find.  
    
    
    On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote:
    
        Can you share the logs of the dying containers and the AM to debug further?
        
        -Gour
        
        On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com> wrote:
        
            I've been using slider 0.91 for a year and it's been very stable lately.
            I built 0.92 to test it and my yarn containers are dying after 10 minutes.
            Slider restarts them successfully, but this isn't acceptable behavior.
            Any thoughts on what could be going on?  
            
            I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
            Is there some configuration in my job that I should have changed to use 0.92?
            
            Thanks,
            -david

Re: [EXTERNAL] Re: slider 0.92 question

Posted by "David.Serafini" <Da...@target.com>.

I've attached what I can find.  


On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote:

    Can you share the logs of the dying containers and the AM to debug further?
    
    -Gour
    
    On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com> wrote:
    
        I've been using slider 0.91 for a year and it's been very stable lately.
        I built 0.92 to test it and my yarn containers are dying after 10 minutes.
        Slider restarts them successfully, but this isn't acceptable behavior.
        Any thoughts on what could be going on?  
        
        I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
        Is there some configuration in my job that I should have changed to use 0.92?
        
        Thanks,
        -david

Re: slider 0.92 question

Posted by Gour Saha <gs...@hortonworks.com>.

Can you share the logs of the dying containers and the AM to debug further?

-Gour

On 4/3/18, 6:49 PM, "David.Serafini" <Da...@target.com> wrote:

    I've been using slider 0.91 for a year and it's been very stable lately.
    I built 0.92 to test it and my yarn containers are dying after 10 minutes.
    Slider restarts them successfully, but this isn't acceptable behavior.
    Any thoughts on what could be going on?  
    
    I looked for some kind of release notes for 0.92, but didn't find anything except a list of ticket ids.
    Is there some configuration in my job that I should have changed to use 0.92?
    
    Thanks,
    -david