You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Dmitry Lysnichenko (JIRA)" <ji...@apache.org> on 2014/01/17 16:41:19 UTC
[jira] [Commented] (AMBARI-4324) Server should rely on command reports when considering tasks timed out

    [ https://issues.apache.org/jira/browse/AMBARI-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874882#comment-13874882 ] 

Dmitry Lysnichenko commented on AMBARI-4324:
--------------------------------------------

h1. Implementation proposal:

1. Add new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command for cancellation.
2. At the server side, commands of this type are issued when tasks are considered timed out. I'm going to do that here: org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right after arrival (they are not put into ActionQueue). If command mentioned by the CANCEL_COMMAND is not present in ActionQueue (it is already in progress or completed), CANCEL_COMMAND is silently ignored.
4. Also, agent clears entire action queue when it can not continue exchanging heartbeats with the server (disconnect or registration requested). I'm going to add an appropriate logic to src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The motivation is to make recovery from network/server fail more reliable and fast (agent will have an empty ActionQueue and can start executing new EXECUTION_COMMANDS and STATUS_COMMANDS right after registration).
5. In both cases described above (executing a single CANCEL_COMMAND or clearing entire ActionQueue) EXECUTION_COMMANDS are considered transactional-like.  I mean that EXECUTION_COMMANDs that are already IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system in misconfigured/unpredictable state.

Also, I'm going to fix a bug at org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage . Here, we pass stage timeout instead of task timeout as a parameter of org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded . After bugfix, task timeout + some small time will be passed as a parameter value. Additional smal time (10-30 seconds) is needed to avoid sending CANCEL_COMMAND without absolute necessary (task will timeout at agent automatically in most cases).

This implementation should also solve another related jira AMBARI-4324

[~mahadev] and/or [~sumitmohanty], can you please take a look on this proposal?


> Server should rely on command reports when considering tasks timed out
> ----------------------------------------------------------------------
>
>                 Key: AMBARI-4324
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4324
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> As of now, task timeout at server and timeout at agent are two different mechanisms, that currently work independently and duplicate each other. 
> Such behaviour leads to strange scenario:
> - cluster installation is started
> - execution of some command exceeds timeout
> - server considers this command and *all next* commands in request timed out. This state is shown at UI as well.
> - at the same time, agent considers currently executed command timed out an kills it. After that, agent starts executing the next command in queue. If next commands does not fail, agent sends COMPLETE status reports.
> - server receives  COMPLETE status reports and updates component status.
> - if user clicks "Retry installation", only tasks for not installed components are created.
> - as a result, UI shows less tasks than user expects
> Changes in scope of this jira:
> add TIMEDOUT command status report type at agent. At the server side, HostRoleStatus enum already has this status type. Modify server behaviour: server considers a task timed out when it receives appropriate command report from the agent. In this case, all task time tracking logic is consolidated at agent. Doing that will simplify timeout handling for CustomCommands and CustomActions.
> Some issues may occur when agent host goes down and therefore does not send any command reports. Server should have some handling for such case .



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)