You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stratos.apache.org by "Shaheed Haque (JIRA)" <ji...@apache.org> on 2015/04/09 15:47:13 UTC
[jira] [Commented] (STRATOS-1234) Software Update Management Solution for Stratos

    [ https://issues.apache.org/jira/browse/STRATOS-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487361#comment-14487361 ] 

Shaheed Haque commented on STRATOS-1234:
----------------------------------------

Hi Imesh, Sandaruwan,

Here is a written-up proposal. I *think* it covers the various use cases suggested both here and in JIRA STRATOS-1234, but as always, your thoughts on the matter are welcome. The write-up has the form of a “spec” and a “Q&A”. As a next step, I guess we could do a hang-out or con-call or something?

Thoughts welcome…

Thanks, Shaheed

OPERATIONAL STATE COMMANDS

The following commands, with the defined effects, are needed:


·        No command directly affects what I call the “major state” of the Application/Group/Cluster/Cartridge, i.e. the state as reflected in the information CURRENTLY returned by the application/{appId}/runtime information.

·        Each command affects what I call the “operational state” only. The commands and their operational states are:

o   Autoscaling on, off. Autoscaling on is current behaviour.

o   Autohealing on, off. Autohealing on is current behaviour.

o   Maintenance off, restart, replace. Maintenance off is current behaviour.

o   (We can add more later if needed)

Command

Server effect

Cartridge effect

Autoscaling off.

CEP and gathers stats and history as usual. Autoscalar operates as usual, except that no scaling is done. Instead, a cluster state variable tracks the normal, overload or underload state and logs messages when this state variable changes value.

No effect on running cartridges. No new cartridges are spun up, no existing cartridges are spun down EXCEPT for autohealing.

Autohealing off.

CEP ignores any heartbeat timeout other than to log that it happened, and set an instance state variable to track this.
When autohealing is turned back on, the timeout will happen again, and the failure will be acted upon normally, except that the log shall make it clear (using the instance state variable) that the autohealing had been delayed.

No new cartridges are spun up until after the autohealing is enabled.

Maintenance restart.

Like autohealing off except that the an extra state variable is set indicating maintenance mode is in effect.

The both state variables are cleared when the Cartridge resume event is seen.

Cartridge is signalled with an *event*, not a blocking callout.

Cartridge application must be able to reboot or just restart, and have the cartridge agent resume its previous (active/inactive) state. When resuming, the agent signals the server with a resume *event*.

Note this implies the cartridge agent is restartable (because the application can choose to reboot).

Maintenance replace.

Like maintenance restart except that the cartridge instance is replaced.

The difference between “restart” and “replace” is that the latter is for applications that cannot update themselves, but expect essentially a new VM instance with the new software.

In other words, this is the big hammer/most general approach to upgrades (e.g. this is more likely to work that an apt-get downgrade ☺).



·        Each command referred to here is a REST API call.

·        Each command can apply to an entire Application, or any nested level (group or cartridge) within it.

·        Arguments for application-wide use case:

o   application={appId}, operationalState={command}

·        Arguments for nested-level use case:

o   application={appId}, nesting={0}/{1}/{2}/…/{n}, operationalState ={command}

Q&A


1.      What’s the point of restart/replace, over and above auto* off?



These are to actually cause the application software in the VM instance to take note to do something. Typically, I would expect this to result in an internally-managed software update. For example think of a VMs running Ubuntu, and pointing to a known repository of say security patches, they could all just do a “apt-get update/upgrade”.



The Cartridge logic is defined to be event-based rather than blocking, because making the thing blocking would be a problem if a reboot was involved. (Also, generally, blocking operations in a distributed system raise too many edge cases like: can this operation be cancelled? Repeated? etc.).


2.      Propagation/inheritance rules



I see two options:



·        Use hierarchy. If you apply a thing a hierarchy level n, and n has internal structure (i.e. it is a group not a cartridge), the command propagates all the way down (note: this is implied in what I said for the application level command).

·        Do not use hierarchy. The command only applies to the level to which is was addressed by the REST call.

In either case, the effect of contradictory commands is UNDEFINED, i.e. toggling the flags in quick succession will likely result in an unhelpful outcome.

I think the normal approach is NOT to use hierarchy; after all just because there is a upgrade to be applied for application code in a given set of VMs, there is nothing to say that any elements lower down the hierarchy should be upgraded at the same time. Even in the case where (say) security patches to a common OS are to be applied, I would doubt the sanity of anybody doing this across every VM in the whole system in one go ☺. OTOH, maybe I am wrong!


3.      Should these commands apply to “deployed” or only to “configured” Applications?



I think the commands can be applied whether the Application is deployed or not….clearly the stuff that sets flags on instances has to set those flags on all current and future instances that may spin up under a given deployment.



From: Imesh Gunaratne [mailto:imesh@apache.org]
Sent: 27 March 2015 04:21
To: dev
Subject: Re: Maintenance modes (was RE: [jira] [Commented] (STRATOS-1234) Software Update Management Solution for Stratos)

Hi Shaheed,

A really good suggestion! I think we could to manage what you have suggested in the same implementation as they overlap. I'm +1 for the idea of putting a cluster into the "Maintenance Mode" manually for diagnostic purposes and stop autoscaling it. We could introduce new API methods to manage this. The only question is whether we could use the same instance state for all the scenarios:

1. Update platform (might need to use the term platform here as it may get confused with the software that may run on the platform)
2. Apply patches
3. Pause a cluster for diagnostic purposes

I would like to suggest to change the updateSoftware API method to updatePlatform:
POST /applications/{applicationId}/updatePlatform

May be we could introduce a new API method as follows to put a cluster into "Maintenance/Diagnostic Mode":
POST /clusters/{clusterId}/pause

Thanks
Imesh

On Thu, Mar 26, 2015 at 3:01 PM, Shaheedur Haque (shahhaqu) <sh...@cisco.com>> wrote:

First, let me say that I like a lot of what is proposed in this JIRA, but I am forking the thread here because I would like to suggest that we generalise just one part of it, the API into Stratos to cover a set of related use cases.

In the current version of this JIRA, the proposed API into Stratos looks like this:

PUT /api/applications/{applicationId} /updateSoftware

(see the JIRA section 2.3 for the details). I think this is actually one of a set of possible runtime states that we would like to put VM instances and various parts of Stratos in. Notice that I am deliberately not using specific terms such as "cluster" or "Autoscalar" because working that out is the point of this email.

So, the sorts of use cases I have in mind are:

  *   Updating the cartridge software as per this JIRA
  *   Putting a cluster (or maybe an instance) into a "maintenance mode" for diagnostic reasons. There could be multiple versions of this maintenance mode where (for example)

     *   The instance(s) might still handle traffic and deliver "I'm alive" health stats but no autoscaling is done.
     *   The instance(s) don't deliver health stats but no health stats

  *   Some of these would deliver notifications to the cartridge agent, others might only affect Stratos component(s).
  *   etc...other ideas anybody?

Thus, it might make sense to generalise the API to support  a set of closely related cases. Is there interest in taking such an approach to address this JIRA as well in clarifying and addressing the other use cases?


Thanks, Shaheed



>  Software Update Management Solution for Stratos
> ------------------------------------------------
>
>                 Key: STRATOS-1234
>                 URL: https://issues.apache.org/jira/browse/STRATOS-1234
>             Project: Stratos
>          Issue Type: New Feature
>            Reporter: Imesh Gunaratne
>              Labels: gsoc2015, mentor
>
> Stratos uses Virtual Machines and Containers for hosting platform services on different Infrastructure as a Service (IaaS) solutions. At present Puppet is used for orchestration management on Virtual Machine based systems and manages all required software in Puppet Master. Container based systems creates Docker images for each platform service by including required software in the Docker image itself.
> In Virtual Machine use-case VM instances will communicate with Puppet master and execute the software installation. The same approach can be used for applying software updates. 
> In Docker use-case we do not use Puppet because a new container with required software can be started in few seconds. This is very efficient compared to using Puppet and installing software on demand.
> The requirement of this project is to implement a core Stratos feature to propagate software updates in a live PaaS environment.
> 1. Puppet based solution:
> - Push software updates of a cartridge to Puppet Master (might not need to automate).
> - Invoke the software update process via the Stratos API for a given application.
> - Stratos Manager could send a new event to trigger puppet agent in each instance to apply the updates.
> 2. Docker based solution
> - Create a new docker image (with a new image id) for the cartridge with software updates (might not need to automate).
> - Invoke the software update process via the Stratos API for a given application.
> - Autoscaler can implement a new feature to bring down existing instances and create new instances with the new docker image id.
> Important!
> - In each scenario if updates are backward compatible, software update process should execute in phases, it should not bring down the entire cluster to apply the updates. If so the service will be unavailable for a certain time period. The idea is to apply the updates to set of members at a time.
> - If the updates are not backward compatible, we could make the entire cluster unavailable at once and apply the updates.
> - Member's state needs to be changed to a new state called "Updating" when applying the updates.
> If there is an interest on doing this project please send a mail to imesh at apache dot org by copying Apache Dev mailing list [1]. Please refer Stratos Wiki [2] for more information on Stratos architecture and how it works.
> [1] http://stratos.apache.org/community/mailing-lists.html
> [2] https://cwiki.apache.org/confluence/display/STRATOS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)