You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "Jonathan Maron (JIRA)" <ji...@apache.org> on 2014/12/03 17:39:12 UTC

[jira] [Comment Edited] (SLIDER-629) Slider's count of failure threshold may not be accurate or it could be a logging issue

    [ https://issues.apache.org/jira/browse/SLIDER-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233150#comment-14233150 ] 

Jonathan Maron edited comment on SLIDER-629 at 12/3/14 4:38 PM:
----------------------------------------------------------------

I need some assistance in trying to come up with the correct approach here:

1)  method SliderAppMaster.scheduleFailureWindowResets exists but is never called.  Intentional?
2)  AppState.onCompletedNode calls noteFailed(), incrementing the "failed" count.  However, it isn't clear whether AppState.checkFailureThreshold() is called as well as part of a container completion due to failure?  I'm finding it difficult to follow the various failure code paths, so just looking for some clarification of the flow.

Adding [~stevel@apache.org] for review of questions above...


was (Author: jmaron):
I need some assistance in trying to come up with the correct approach here:

1)  method SliderAppMaster.scheduleFailureWindowResets exits but is never called.  Intentional?
2)  AppState.onCompletedNode calls noteFailed(), incrementing the "failed" count.  However, it isn't clear whether AppState.checkFailureThreshold() is called as well as part of a container completion due to failure?  I'm finding it difficult to follow the various failure code paths, so just looking for some clarification of the flow.

> Slider's count of failure threshold may not be accurate or it could be a logging issue
> --------------------------------------------------------------------------------------
>
>                 Key: SLIDER-629
>                 URL: https://issues.apache.org/jira/browse/SLIDER-629
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster
>    Affects Versions: Slider 0.50
>            Reporter: Sumit Mohanty
>            Assignee: Jonathan Maron
>             Fix For: Slider 0.70
>
>
> One of the long running HBase tests failed with the following error:
> {noformat}
> 2014-11-08 01:07:26,407 [AmExecutor-008] ERROR appmaster.SliderAppMaster - Cluster teardown triggered org.apache.slider.core.exceptions.TriggerClusterTeardownException: Unstable Application Instance : - failed with component H       BASE_REGIONSERVER failing 8 times (0 in startup); threshold is 5 - last failure: Failure container_1415341585168_0005_01_000008 on host onprem-slider23: http://onprem-slider21:19888/jobhistory/logs/onprem-slider23:45454/contai       ner_1415341585168_0005_01_000008/ctx/hadoop^M
> {noformat}
> However, there were total of "9" REGION_SERVERs created.
> {noformat}
> 2014-11-07 16:00:35,346 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000002, on onprem-slider25:45454,
> 2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000005, on onprem-slider24:45454,
> 2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000007, on onprem-slider22:45454,
> 2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000008, on onprem-slider23:45454,
> 2014-11-07 23:51:20,040 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000009, on onprem-slider22:45454,
> 2014-11-07 23:58:44,810 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000013, on onprem-slider24:45454,
> 2014-11-08 00:12:17,804 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000015, on onprem-slider22:45454,
> 2014-11-08 00:15:57,373 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000018, on onprem-slider25:45454,
> 2014-11-08 01:06:36,771 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000020, on onprem-slider25:45454,
> {noformat}
> As the ask was for 4 but 9 were created, obviously there are 5 failures.
> Perhaps its a logging issue. Can we also print the Window - e.g. 5 failures in X minutes or hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)