You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Sumit Mohanty (JIRA)" <ji...@apache.org> on 2015/04/20 19:12:59 UTC

[jira] [Commented] (AMBARI-10606) Ambari Agent needs to retry failed install/start operations

    [ https://issues.apache.org/jira/browse/AMBARI-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503220#comment-14503220 ] 

Sumit Mohanty commented on AMBARI-10606:
----------------------------------------

There are two scenarios currently where retry feature can help:
* (see description) Component A depends on component B to be up. B is not up yet so start of component A should be retried a few times to allow for delays. While this does not guarantee success it increases the chance of success.
* When commands are executed in parallel then same component B may fail to start waiting for A to start. So if start of B is retried few times with some delay in between chance of success increases

The biggest difference between the two is how long to wait for retry and how long to retry for. The second scenario typically requires a shorter wait time as hosts are provisioned and its just the ordering of start that is lost while performing commands in parallel. The former is based on difference in time for host provisioning and a longer wait between retry should help. A good compromise is progressively longer wait time between retries and a large number of retries.

> Ambari Agent needs to retry failed install/start operations
> -----------------------------------------------------------
>
>                 Key: AMBARI-10606
>                 URL: https://issues.apache.org/jira/browse/AMBARI-10606
>             Project: Ambari
>          Issue Type: Task
>    Affects Versions: 2.0.0
>            Reporter: Sumit Mohanty
>            Assignee: Sumit Mohanty
>             Fix For: 2.1.0
>
>
> WIth the changes to cluster provisioning in Ambari 2.1, each host is provisioned independently in it's own request. Additionally, users may make provisioning requests prior to hosts becoming available. This means that components that connect to other components in the cluster may start prior to the component that they are attempting to connect to. This connect behavior is outside of Ambari proper and differs significantly between services/components.
> An example of this is HISTORY_SERVER which attempts to connect to NAMENODE and if it fails to connect, it retries a couple of times and fails with a timeout after a small number of seconds.
> As a result, the ambari agent in 2.1 needs to retry failed operations (especially start operations). The retry timeout should be a significant amount of time and could be configurable. This will allow hosts to join the cluster at different times without component connection timeouts causing the request to "fail".
> Currently when a timeout occurs, it doesn't affect other component operations but does result in a "FAILED" response to the user and the user will need to manually start the failed component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)