You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Marcus Sorensen <sh...@gmail.com> on 2013/09/03 19:59:49 UTC

job cancelled because of management server restart

I'm trying to figure out if/how management and agent restarts are
gracefully handled for long running jobs. My initial testing shows
that maybe they aren't. For example, if I try to migrate a storage
volume, and then restart the management server, I end up with two
volumes (source and destination) stuck in migrating state, with the VM
unable to start and the job stating:

            {
                "accountid": "505add16-12d8-11e3-8495-5254004eff4f",
                "cmd":
"org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd",
                "created": "2013-09-03T11:41:55-0600",
                "jobid": "698cc7cf-4ecc-40da-9bcf-261a7921ab95",
                "jobprocstatus": 0,
                "jobresult": {
                    "errorcode": 530,
                    "errortext": "job cancelled because of management
server restart"
                },
                "jobresultcode": 530,
                "jobresulttype": "object",
                "jobstatus": 2,
                "userid": "505bd5d6-12d8-11e3-8495-5254004eff4f"
            }

If all jobs react this way, it doesn't seem like a small bug, but
perhaps a design issue. If a job is cancelled, the state should be
rolled back, I think. Perhaps every job should have a cleanup method
that is called when the job is considered cancelled (assuming the
cancellation occurs prior to shutdown, but then that doesn't handle
crashes).

The end result is that everyone using cloudstack should be terrified
of restarting their mgmt server, I think, especially as their
environment grows and has many things going on. Anything that  goes
through a state machine could get stuck.

Re: job cancelled because of management server restart

Posted by Wei ZHOU <us...@gmail.com>.
make sense


2013/9/3 Marcus Sorensen <sh...@gmail.com>

> I'm trying to figure out if/how management and agent restarts are
> gracefully handled for long running jobs. My initial testing shows
> that maybe they aren't. For example, if I try to migrate a storage
> volume, and then restart the management server, I end up with two
> volumes (source and destination) stuck in migrating state, with the VM
> unable to start and the job stating:
>
>             {
>                 "accountid": "505add16-12d8-11e3-8495-5254004eff4f",
>                 "cmd":
> "org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd",
>                 "created": "2013-09-03T11:41:55-0600",
>                 "jobid": "698cc7cf-4ecc-40da-9bcf-261a7921ab95",
>                 "jobprocstatus": 0,
>                 "jobresult": {
>                     "errorcode": 530,
>                     "errortext": "job cancelled because of management
> server restart"
>                 },
>                 "jobresultcode": 530,
>                 "jobresulttype": "object",
>                 "jobstatus": 2,
>                 "userid": "505bd5d6-12d8-11e3-8495-5254004eff4f"
>             }
>
> If all jobs react this way, it doesn't seem like a small bug, but
> perhaps a design issue. If a job is cancelled, the state should be
> rolled back, I think. Perhaps every job should have a cleanup method
> that is called when the job is considered cancelled (assuming the
> cancellation occurs prior to shutdown, but then that doesn't handle
> crashes).
>
> The end result is that everyone using cloudstack should be terrified
> of restarting their mgmt server, I think, especially as their
> environment grows and has many things going on. Anything that  goes
> through a state machine could get stuck.
>

Re: job cancelled because of management server restart

Posted by Chip Childers <ch...@sungard.com>.
On Tue, Sep 03, 2013 at 10:53:14PM +0000, Kelven Yang wrote:
> This is a design issue that we need to improve in general. However, a
> simple roll back logic does not solve the problem, since abnormal
> terminate can happen at any time, which means it can happen in the middle
> of job cancellation process as well.
> 
> Under current architecture, the cleanup work is handled in VM sync
> process, we allow jobs to cancel or fail at anytime, this design decision
> may leave temporary failures to operations that are currently carried in
> the stopping/crashed management server, VM sync process will do
> self-healing and carry back of the consistency of system data. This design
> choice itself is still acceptable to a certain level, unfortunately, this
> process is buggy in current CloudStack releases. The example Marcus gave
> falls in the category of having bug in re-sync VM in migrating state
> (basically to fail it and allow user to re-issue the command).
> 
> I've refactored the modeling used by VM sync process but wasn't able to
> merge into the main branch for 4.2 release due to concerns from community
> about its late readiness time for architecture changes. Will reiterate the
> merge effort after 4.2 release.

Now would be a good time to consider merging into master...

Re: job cancelled because of management server restart

Posted by Kelven Yang <ke...@citrix.com>.
This is a design issue that we need to improve in general. However, a
simple roll back logic does not solve the problem, since abnormal
terminate can happen at any time, which means it can happen in the middle
of job cancellation process as well.

Under current architecture, the cleanup work is handled in VM sync
process, we allow jobs to cancel or fail at anytime, this design decision
may leave temporary failures to operations that are currently carried in
the stopping/crashed management server, VM sync process will do
self-healing and carry back of the consistency of system data. This design
choice itself is still acceptable to a certain level, unfortunately, this
process is buggy in current CloudStack releases. The example Marcus gave
falls in the category of having bug in re-sync VM in migrating state
(basically to fail it and allow user to re-issue the command).

I've refactored the modeling used by VM sync process but wasn't able to
merge into the main branch for 4.2 release due to concerns from community
about its late readiness time for architecture changes. Will reiterate the
merge effort after 4.2 release.

Kelven 

On 9/3/13 10:59 AM, "Marcus Sorensen" <sh...@gmail.com> wrote:

>I'm trying to figure out if/how management and agent restarts are
>gracefully handled for long running jobs. My initial testing shows
>that maybe they aren't. For example, if I try to migrate a storage
>volume, and then restart the management server, I end up with two
>volumes (source and destination) stuck in migrating state, with the VM
>unable to start and the job stating:
>
>            {
>                "accountid": "505add16-12d8-11e3-8495-5254004eff4f",
>                "cmd":
>"org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd",
>                "created": "2013-09-03T11:41:55-0600",
>                "jobid": "698cc7cf-4ecc-40da-9bcf-261a7921ab95",
>                "jobprocstatus": 0,
>                "jobresult": {
>                    "errorcode": 530,
>                    "errortext": "job cancelled because of management
>server restart"
>                },
>                "jobresultcode": 530,
>                "jobresulttype": "object",
>                "jobstatus": 2,
>                "userid": "505bd5d6-12d8-11e3-8495-5254004eff4f"
>            }
>
>If all jobs react this way, it doesn't seem like a small bug, but
>perhaps a design issue. If a job is cancelled, the state should be
>rolled back, I think. Perhaps every job should have a cleanup method
>that is called when the job is considered cancelled (assuming the
>cancellation occurs prior to shutdown, but then that doesn't handle
>crashes).
>
>The end result is that everyone using cloudstack should be terrified
>of restarting their mgmt server, I think, especially as their
>environment grows and has many things going on. Anything that  goes
>through a state machine could get stuck.