You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Dmitry Lysnichenko (JIRA)" <ji...@apache.org> on 2014/03/13 15:07:42 UTC

[jira] [Resolved] (AMBARI-4992) Sometimes cluster installation pauses for few minutes between tasks

     [ https://issues.apache.org/jira/browse/AMBARI-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lysnichenko resolved AMBARI-4992.
----------------------------------------

    Resolution: Fixed

committed to trunk

> Sometimes cluster installation pauses for few minutes between tasks
> -------------------------------------------------------------------
>
>                 Key: AMBARI-4992
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4992
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent
>    Affects Versions: 1.5.0
>            Reporter: Vitaliy Semenyk
>            Assignee: Dmitry Lysnichenko
>
> h2. The problem
> Primarily affects pluggable (python-based) services.
> During cluster installation, there may be a few significant pauses between task execution. At this time, the previous task shows ip as completed at UI, and the next task shows up as not started yet. This effect may be noticed 1-3 times during installation when installing entire cluster, taking in some cases around 3 minutes for one pause. 
> Initial analysis shows that this time is consumed by executing service checks that has been queued during cluster installation. 
> h2. Some background:
> Server issues a big set of EXECUTION_COMMANDs at once few times during cluster installation. Typically, all commands for one set are sent to agent at once. At agent, status and execution commands are stored at the same queue. While cluster is installed, status commands are appended to the end of the queue. So when the last command for INSTALL is completed, we have a large number of status commands at the queue (hundreds?). Executing them may take around 3 minutes. START commands that have been issued by the server will not be scheduled for execution until all STATUS_COMMANDs at the queue are perform. At UI, installation it looks like installation hang up.
> h2. Why it became noticeable at pluggable services:
>  It's due to few factors:
> - python services install faster
> - status commands ran a bit slower because we invoke a separate subprocess to determine every status, and also perform more IO
> I've attached a relevant log (The interesting part is after text 
> {code}
> INFO 2013-12-18 13:43:44,163 Heartbeat.py:76 - Sending heartbeat with response id: 419 and timestamp: 1387374224161. Command(s) in progress: True. Components mapped: True
> {code}
> Zookeeper start has been finished and after that,  only status commands have been executing for few minutes (the START task for the next component just showed up as scheduled, but not started yet at UI).
> h2. Selected solution
>  I prefer the approach of checking if the command queue is empty and then picking status commands from last_status. It is better as it can be done every 2 seconds whereas status commands are send by the server only every minute. I assume we still do not store duplicate commands in last_status.



--
This message was sent by Atlassian JIRA
(v6.2#6252)