You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Dmitry Lysnichenko (JIRA)" <ji...@apache.org> on 2014/03/12 16:19:42 UTC

[jira] [Updated] (AMBARI-4992) Sometimes cluster installation pauses for few minutes between tasks

     [ https://issues.apache.org/jira/browse/AMBARI-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lysnichenko updated AMBARI-4992:
---------------------------------------

    Description: 
h2. The problem
Primarily affects pluggable (python-based) services.
During cluster installation, there may be a few significant pauses between task execution. At this time, the previous task shows ip as completed at UI, and the next task shows up as not started yet. This effect may be noticed 1-3 times during installation when installing entire cluster, taking in some cases around 3 minutes for one pause. 
Initial analysis shows that this time is consumed by executing service checks that has been queued during cluster installation. 

h2. Some background:
Server issues a big set of EXECUTION_COMMANDs at once few times during cluster installation. Typically, all commands for one set are sent to agent at once. At agent, status and execution commands are stored at the same queue. While cluster is installed, status commands are appended to the end of the queue. So when the last command for INSTALL is completed, we have a large number of status commands at the queue (hundreds?). Executing them may take around 3 minutes. START commands that have been issued by the server will not be scheduled for execution until all STATUS_COMMANDs at the queue are perform. At UI, installation it looks like installation hang up.

h2. Why it became noticeable at pluggable services:
 It's due to few factors:
- python services install faster
- status commands ran a bit slower because we invoke a separate subprocess to determine every status, and also perform more IO

I've attached a relevant log (The interesting part is after text 
{code}
INFO 2013-12-18 13:43:44,163 Heartbeat.py:76 - Sending heartbeat with response id: 419 and timestamp: 1387374224161. Command(s) in progress: True. Components mapped: True
{code}
Zookeeper start has been finished and after that,  only status commands have been executing for few minutes (the START task for the next component just showed up as scheduled, but not started yet at UI).

h2. Possible solutions
- have a separate queue for status commands and execute status commands at a separate thread (preferable)
- or when adding new status commands to the ActionQueue, remove existing status commands at the end of the queue (hack)

> Sometimes cluster installation pauses for few minutes between tasks
> -------------------------------------------------------------------
>
>                 Key: AMBARI-4992
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4992
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent
>            Reporter: Vitaliy Semenyk
>            Assignee: Dmitry Lysnichenko
>
> h2. The problem
> Primarily affects pluggable (python-based) services.
> During cluster installation, there may be a few significant pauses between task execution. At this time, the previous task shows ip as completed at UI, and the next task shows up as not started yet. This effect may be noticed 1-3 times during installation when installing entire cluster, taking in some cases around 3 minutes for one pause. 
> Initial analysis shows that this time is consumed by executing service checks that has been queued during cluster installation. 
> h2. Some background:
> Server issues a big set of EXECUTION_COMMANDs at once few times during cluster installation. Typically, all commands for one set are sent to agent at once. At agent, status and execution commands are stored at the same queue. While cluster is installed, status commands are appended to the end of the queue. So when the last command for INSTALL is completed, we have a large number of status commands at the queue (hundreds?). Executing them may take around 3 minutes. START commands that have been issued by the server will not be scheduled for execution until all STATUS_COMMANDs at the queue are perform. At UI, installation it looks like installation hang up.
> h2. Why it became noticeable at pluggable services:
>  It's due to few factors:
> - python services install faster
> - status commands ran a bit slower because we invoke a separate subprocess to determine every status, and also perform more IO
> I've attached a relevant log (The interesting part is after text 
> {code}
> INFO 2013-12-18 13:43:44,163 Heartbeat.py:76 - Sending heartbeat with response id: 419 and timestamp: 1387374224161. Command(s) in progress: True. Components mapped: True
> {code}
> Zookeeper start has been finished and after that,  only status commands have been executing for few minutes (the START task for the next component just showed up as scheduled, but not started yet at UI).
> h2. Possible solutions
> - have a separate queue for status commands and execute status commands at a separate thread (preferable)
> - or when adding new status commands to the ActionQueue, remove existing status commands at the end of the queue (hack)



--
This message was sent by Atlassian JIRA
(v6.2#6252)