You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Makrand <ma...@gmail.com> on 2016/08/08 10:54:29 UTC

Mess after volume migration.

Guys,

My setup:- ACS 4.4.2. Hypervisor: XENserver 6.2.

I tried moving a volume in running VM from primary storage A to primary
storage B (using GUI of cloudstack). Please note, primary storage A LUN
(LUN7)is coming out of one storage box and  primary storage  B LUN (LUN14)
is from another.

For VM1 with 250GB data volume (51 GB used space), I was able to move this
volume without any glitch in about 26mins.

But for VM2 with 250Gb data volume (182 GB used space), the migration
 continued for about ~110 mins and then failed with follwing exception in
very end with message like:-

2016-08-06 14:30:57,481 WARN  [c.c.h.x.r.CitrixResourceBase]
(DirectAgent-192:ctx-5716ad6d) Task failed! Task record:
uuid: 308a8326-2622-e4c5-2019-3beb
87b0d183
           nameLabel: Async.VDI.pool_migrate
     nameDescription:
   allowedOperations: []
   currentOperations: {}
             created: Sat Aug 06 12:36:27 UTC 2016
            finished: Sat Aug 06 14:30:32 UTC 2016
              status: failure
          residentOn: com.xensource.xenapi.Host@f242d3ca
            progress: 1.0
                type: <none/>
              result:
           errorInfo: [SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden
[opterr=SR 96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing
process]]
         otherConfig: {}
           subtaskOf: com.xensource.xenapi.Task@aaf13f6f
            subtasks: []


So cloudstack just removed the JOB telling it failed, says the mangement
server log.

A) But when I am checking it at hyeprvisor level, the volume is on new SR
i.e. on LUN14. Strange huh? So now the new uuid for this volume from XE cli
is like

[root@gcx-bom-compute1 ~]# xe vbd-list
vm-uuid=3fcb3070-e373-3cf9-d0aa-0a657142a38d
uuid ( RO)             : f15dc54a-3868-8de8-5427-314e341879c6
          vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
    vm-name-label ( RO): i-22-803-VM
         vdi-uuid ( RO): cc1f8e83-f224-44b7-9359-282a1c1e3db1
            empty ( RO): false
           device ( RO): hdb

B) But luckily I had the entry taken before migration  and it shows like:-

uuid ( RO) : f15dc54a-3868-8de8-5427-314e341879c6
vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
vm-name-label ( RO): i-22-803-VM
vdi-uuid ( RO): 7c073522-a077-41a0-b9a7-7b61847d413b
empty ( RO): false
device ( RO): hdb

C) Since this failed at cloudstack, the DB is still holding old value.
Here is current volume table entry in DB

id: 1004
>                 account_id: 22
>                  domain_id: 15
>                    pool_id: 18
>               last_pool_id: NULL
>                instance_id: 803
>                  device_id: 1
>                       name:
> cloudx_globalcloudxchange_com_W2797T2808S3112_V1462960751
>                       uuid: a8f01042-d0de-4496-98fa-a0b13648bef7
>                       size: 268435456000
>                     folder: NULL
>                       path: 7c073522-a077-41a0-b9a7-7b61847d413b
>                     pod_id: NULL
>             data_center_id: 2
>                 iscsi_name: NULL
>                    host_ip: NULL
>                volume_type: DATADISK
>                  pool_type: NULL
>           disk_offering_id: 6
>                template_id: NULL
> first_snapshot_backup_uuid: NULL
>                recreatable: 0
>                    created: 2016-05-11 09:59:12
>                   attached: 2016-05-11 09:59:21
>                    updated: 2016-08-06 14:30:57
>                    removed: NULL
>                      state: Ready
>                 chain_info: NULL
>               update_count: 42
>                  disk_type: NULL
>     vm_snapshot_chain_size: NULL
>                     iso_id: NULL
>             display_volume: 1
>                     format: VHD
>                   min_iops: NULL
>                   max_iops: NULL
>              hv_ss_reserve: 0
> 1 row in set (0.00 sec)
>


So the path variable shows value as 7c073522-a077-41a0-b9a7-7b61847d413b
and pool id as 18.

The VM is running as of now, but I am sure the moment I will reboot, this
volume will be gone or worst VM won't boot. This is production VM BTW.

D) So I think I need to edit volume table for path and pool_id parameters
and need to place new values in place and then reboot VM. Do I need to make
any more changes in DB in some other tables for same? Any comment/help is
much appreciated.




--
Best,
Makrand

Re: Mess after volume migration.

Posted by Yiping Zhang <yz...@marketo.com>.
I encountered the same problem a few months ago.  With help from this list, I fixed my problems without any data loss, and posted my solution on the list.  If you search the following subject line “corrupt DB after VM live migration with storage migration”,  you should see my posts.

Good luck

Yiping

On 8/9/16, 3:30 AM, "Makrand" <ma...@gmail.com> wrote:

    Ilya,
    
    Point to be noted is that my job didn't failed coz of the  timeout, but
    rather coz of some VDI parameter at XENServer with below exception.
    
    [SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden [opterr=SR
    96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing process]]
    
    I am still digging on this error from SMlogs etc on XEN server. But in
    reality volume was migrated and I think that's important.
    
    
    I, off course, faced timeout error during initial testing and after some
    trial and error I realised that there is this "not so properly named
    parameter" called *wait* (1800 default value) that needs to be modified in
    end to make timeout error go away.
    
    So all in all I modified parameters as below:-
    
    migratewait: 36000
    storage.pool.max.waitseconds: 36000
    vm.op.cancel.interval: 36000
    vm.op.cleanup.wait: 36000
    wait:18000
    
    
    
    
    
    --
    Best,
    Makrand
    
    
    On Tue, Aug 9, 2016 at 6:07 AM, ilya <il...@gmail.com> wrote:
    
    > this happened to us on non XEN hypervisor as well.
    >
    > CloudStack has a timeout for a long running jobs - which i assume in
    > your case - it has exceeded.
    >
    > Changing volumes table should be enough by referencing proper pool_id.
    > Just make sure that data size matches on both ends.
    >
    > consider changing
    > "copy.volume.wait" (if that does not help) also "vm.job.timeout"
    >
    >
    > Regards
    > ilya
    >
    > On 8/8/16 3:54 AM, Makrand wrote:
    > > Guys,
    > >
    > > My setup:- ACS 4.4.2. Hypervisor: XENserver 6.2.
    > >
    > > I tried moving a volume in running VM from primary storage A to primary
    > > storage B (using GUI of cloudstack). Please note, primary storage A LUN
    > > (LUN7)is coming out of one storage box and  primary storage  B LUN
    > (LUN14)
    > > is from another.
    > >
    > > For VM1 with 250GB data volume (51 GB used space), I was able to move
    > this
    > > volume without any glitch in about 26mins.
    > >
    > > But for VM2 with 250Gb data volume (182 GB used space), the migration
    > >  continued for about ~110 mins and then failed with follwing exception in
    > > very end with message like:-
    > >
    > > 2016-08-06 14:30:57,481 WARN  [c.c.h.x.r.CitrixResourceBase]
    > > (DirectAgent-192:ctx-5716ad6d) Task failed! Task record:
    > > uuid: 308a8326-2622-e4c5-2019-3beb
    > > 87b0d183
    > >            nameLabel: Async.VDI.pool_migrate
    > >      nameDescription:
    > >    allowedOperations: []
    > >    currentOperations: {}
    > >              created: Sat Aug 06 12:36:27 UTC 2016
    > >             finished: Sat Aug 06 14:30:32 UTC 2016
    > >               status: failure
    > >           residentOn: com.xensource.xenapi.Host@f242d3ca
    > >             progress: 1.0
    > >                 type: <none/>
    > >               result:
    > >            errorInfo: [SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden
    > > [opterr=SR 96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing
    > > process]]
    > >          otherConfig: {}
    > >            subtaskOf: com.xensource.xenapi.Task@aaf13f6f
    > >             subtasks: []
    > >
    > >
    > > So cloudstack just removed the JOB telling it failed, says the mangement
    > > server log.
    > >
    > > A) But when I am checking it at hyeprvisor level, the volume is on new SR
    > > i.e. on LUN14. Strange huh? So now the new uuid for this volume from XE
    > cli
    > > is like
    > >
    > > [root@gcx-bom-compute1 ~]# xe vbd-list
    > > vm-uuid=3fcb3070-e373-3cf9-d0aa-0a657142a38d
    > > uuid ( RO)             : f15dc54a-3868-8de8-5427-314e341879c6
    > >           vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
    > >     vm-name-label ( RO): i-22-803-VM
    > >          vdi-uuid ( RO): cc1f8e83-f224-44b7-9359-282a1c1e3db1
    > >             empty ( RO): false
    > >            device ( RO): hdb
    > >
    > > B) But luckily I had the entry taken before migration  and it shows
    > like:-
    > >
    > > uuid ( RO) : f15dc54a-3868-8de8-5427-314e341879c6
    > > vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
    > > vm-name-label ( RO): i-22-803-VM
    > > vdi-uuid ( RO): 7c073522-a077-41a0-b9a7-7b61847d413b
    > > empty ( RO): false
    > > device ( RO): hdb
    > >
    > > C) Since this failed at cloudstack, the DB is still holding old value.
    > > Here is current volume table entry in DB
    > >
    > > id: 1004
    > >>                 account_id: 22
    > >>                  domain_id: 15
    > >>                    pool_id: 18
    > >>               last_pool_id: NULL
    > >>                instance_id: 803
    > >>                  device_id: 1
    > >>                       name:
    > >> cloudx_globalcloudxchange_com_W2797T2808S3112_V1462960751
    > >>                       uuid: a8f01042-d0de-4496-98fa-a0b13648bef7
    > >>                       size: 268435456000
    > >>                     folder: NULL
    > >>                       path: 7c073522-a077-41a0-b9a7-7b61847d413b
    > >>                     pod_id: NULL
    > >>             data_center_id: 2
    > >>                 iscsi_name: NULL
    > >>                    host_ip: NULL
    > >>                volume_type: DATADISK
    > >>                  pool_type: NULL
    > >>           disk_offering_id: 6
    > >>                template_id: NULL
    > >> first_snapshot_backup_uuid: NULL
    > >>                recreatable: 0
    > >>                    created: 2016-05-11 09:59:12
    > >>                   attached: 2016-05-11 09:59:21
    > >>                    updated: 2016-08-06 14:30:57
    > >>                    removed: NULL
    > >>                      state: Ready
    > >>                 chain_info: NULL
    > >>               update_count: 42
    > >>                  disk_type: NULL
    > >>     vm_snapshot_chain_size: NULL
    > >>                     iso_id: NULL
    > >>             display_volume: 1
    > >>                     format: VHD
    > >>                   min_iops: NULL
    > >>                   max_iops: NULL
    > >>              hv_ss_reserve: 0
    > >> 1 row in set (0.00 sec)
    > >>
    > >
    > >
    > > So the path variable shows value as 7c073522-a077-41a0-b9a7-7b61847d413b
    > > and pool id as 18.
    > >
    > > The VM is running as of now, but I am sure the moment I will reboot, this
    > > volume will be gone or worst VM won't boot. This is production VM BTW.
    > >
    > > D) So I think I need to edit volume table for path and pool_id parameters
    > > and need to place new values in place and then reboot VM. Do I need to
    > make
    > > any more changes in DB in some other tables for same? Any comment/help is
    > > much appreciated.
    > >
    > >
    > >
    > >
    > > --
    > > Best,
    > > Makrand
    > >
    >
    


Re: Mess after volume migration.

Posted by Makrand <ma...@gmail.com>.
Ilya,

Point to be noted is that my job didn't failed coz of the  timeout, but
rather coz of some VDI parameter at XENServer with below exception.

[SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden [opterr=SR
96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing process]]

I am still digging on this error from SMlogs etc on XEN server. But in
reality volume was migrated and I think that's important.


I, off course, faced timeout error during initial testing and after some
trial and error I realised that there is this "not so properly named
parameter" called *wait* (1800 default value) that needs to be modified in
end to make timeout error go away.

So all in all I modified parameters as below:-

migratewait: 36000
storage.pool.max.waitseconds: 36000
vm.op.cancel.interval: 36000
vm.op.cleanup.wait: 36000
wait:18000





--
Best,
Makrand


On Tue, Aug 9, 2016 at 6:07 AM, ilya <il...@gmail.com> wrote:

> this happened to us on non XEN hypervisor as well.
>
> CloudStack has a timeout for a long running jobs - which i assume in
> your case - it has exceeded.
>
> Changing volumes table should be enough by referencing proper pool_id.
> Just make sure that data size matches on both ends.
>
> consider changing
> "copy.volume.wait" (if that does not help) also "vm.job.timeout"
>
>
> Regards
> ilya
>
> On 8/8/16 3:54 AM, Makrand wrote:
> > Guys,
> >
> > My setup:- ACS 4.4.2. Hypervisor: XENserver 6.2.
> >
> > I tried moving a volume in running VM from primary storage A to primary
> > storage B (using GUI of cloudstack). Please note, primary storage A LUN
> > (LUN7)is coming out of one storage box and  primary storage  B LUN
> (LUN14)
> > is from another.
> >
> > For VM1 with 250GB data volume (51 GB used space), I was able to move
> this
> > volume without any glitch in about 26mins.
> >
> > But for VM2 with 250Gb data volume (182 GB used space), the migration
> >  continued for about ~110 mins and then failed with follwing exception in
> > very end with message like:-
> >
> > 2016-08-06 14:30:57,481 WARN  [c.c.h.x.r.CitrixResourceBase]
> > (DirectAgent-192:ctx-5716ad6d) Task failed! Task record:
> > uuid: 308a8326-2622-e4c5-2019-3beb
> > 87b0d183
> >            nameLabel: Async.VDI.pool_migrate
> >      nameDescription:
> >    allowedOperations: []
> >    currentOperations: {}
> >              created: Sat Aug 06 12:36:27 UTC 2016
> >             finished: Sat Aug 06 14:30:32 UTC 2016
> >               status: failure
> >           residentOn: com.xensource.xenapi.Host@f242d3ca
> >             progress: 1.0
> >                 type: <none/>
> >               result:
> >            errorInfo: [SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden
> > [opterr=SR 96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing
> > process]]
> >          otherConfig: {}
> >            subtaskOf: com.xensource.xenapi.Task@aaf13f6f
> >             subtasks: []
> >
> >
> > So cloudstack just removed the JOB telling it failed, says the mangement
> > server log.
> >
> > A) But when I am checking it at hyeprvisor level, the volume is on new SR
> > i.e. on LUN14. Strange huh? So now the new uuid for this volume from XE
> cli
> > is like
> >
> > [root@gcx-bom-compute1 ~]# xe vbd-list
> > vm-uuid=3fcb3070-e373-3cf9-d0aa-0a657142a38d
> > uuid ( RO)             : f15dc54a-3868-8de8-5427-314e341879c6
> >           vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
> >     vm-name-label ( RO): i-22-803-VM
> >          vdi-uuid ( RO): cc1f8e83-f224-44b7-9359-282a1c1e3db1
> >             empty ( RO): false
> >            device ( RO): hdb
> >
> > B) But luckily I had the entry taken before migration  and it shows
> like:-
> >
> > uuid ( RO) : f15dc54a-3868-8de8-5427-314e341879c6
> > vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
> > vm-name-label ( RO): i-22-803-VM
> > vdi-uuid ( RO): 7c073522-a077-41a0-b9a7-7b61847d413b
> > empty ( RO): false
> > device ( RO): hdb
> >
> > C) Since this failed at cloudstack, the DB is still holding old value.
> > Here is current volume table entry in DB
> >
> > id: 1004
> >>                 account_id: 22
> >>                  domain_id: 15
> >>                    pool_id: 18
> >>               last_pool_id: NULL
> >>                instance_id: 803
> >>                  device_id: 1
> >>                       name:
> >> cloudx_globalcloudxchange_com_W2797T2808S3112_V1462960751
> >>                       uuid: a8f01042-d0de-4496-98fa-a0b13648bef7
> >>                       size: 268435456000
> >>                     folder: NULL
> >>                       path: 7c073522-a077-41a0-b9a7-7b61847d413b
> >>                     pod_id: NULL
> >>             data_center_id: 2
> >>                 iscsi_name: NULL
> >>                    host_ip: NULL
> >>                volume_type: DATADISK
> >>                  pool_type: NULL
> >>           disk_offering_id: 6
> >>                template_id: NULL
> >> first_snapshot_backup_uuid: NULL
> >>                recreatable: 0
> >>                    created: 2016-05-11 09:59:12
> >>                   attached: 2016-05-11 09:59:21
> >>                    updated: 2016-08-06 14:30:57
> >>                    removed: NULL
> >>                      state: Ready
> >>                 chain_info: NULL
> >>               update_count: 42
> >>                  disk_type: NULL
> >>     vm_snapshot_chain_size: NULL
> >>                     iso_id: NULL
> >>             display_volume: 1
> >>                     format: VHD
> >>                   min_iops: NULL
> >>                   max_iops: NULL
> >>              hv_ss_reserve: 0
> >> 1 row in set (0.00 sec)
> >>
> >
> >
> > So the path variable shows value as 7c073522-a077-41a0-b9a7-7b61847d413b
> > and pool id as 18.
> >
> > The VM is running as of now, but I am sure the moment I will reboot, this
> > volume will be gone or worst VM won't boot. This is production VM BTW.
> >
> > D) So I think I need to edit volume table for path and pool_id parameters
> > and need to place new values in place and then reboot VM. Do I need to
> make
> > any more changes in DB in some other tables for same? Any comment/help is
> > much appreciated.
> >
> >
> >
> >
> > --
> > Best,
> > Makrand
> >
>

Re: Mess after volume migration.

Posted by ilya <il...@gmail.com>.
this happened to us on non XEN hypervisor as well.

CloudStack has a timeout for a long running jobs - which i assume in
your case - it has exceeded.

Changing volumes table should be enough by referencing proper pool_id.
Just make sure that data size matches on both ends.

consider changing
"copy.volume.wait" (if that does not help) also "vm.job.timeout"


Regards
ilya

On 8/8/16 3:54 AM, Makrand wrote:
> Guys,
> 
> My setup:- ACS 4.4.2. Hypervisor: XENserver 6.2.
> 
> I tried moving a volume in running VM from primary storage A to primary
> storage B (using GUI of cloudstack). Please note, primary storage A LUN
> (LUN7)is coming out of one storage box and  primary storage  B LUN (LUN14)
> is from another.
> 
> For VM1 with 250GB data volume (51 GB used space), I was able to move this
> volume without any glitch in about 26mins.
> 
> But for VM2 with 250Gb data volume (182 GB used space), the migration
>  continued for about ~110 mins and then failed with follwing exception in
> very end with message like:-
> 
> 2016-08-06 14:30:57,481 WARN  [c.c.h.x.r.CitrixResourceBase]
> (DirectAgent-192:ctx-5716ad6d) Task failed! Task record:
> uuid: 308a8326-2622-e4c5-2019-3beb
> 87b0d183
>            nameLabel: Async.VDI.pool_migrate
>      nameDescription:
>    allowedOperations: []
>    currentOperations: {}
>              created: Sat Aug 06 12:36:27 UTC 2016
>             finished: Sat Aug 06 14:30:32 UTC 2016
>               status: failure
>           residentOn: com.xensource.xenapi.Host@f242d3ca
>             progress: 1.0
>                 type: <none/>
>               result:
>            errorInfo: [SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden
> [opterr=SR 96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing
> process]]
>          otherConfig: {}
>            subtaskOf: com.xensource.xenapi.Task@aaf13f6f
>             subtasks: []
> 
> 
> So cloudstack just removed the JOB telling it failed, says the mangement
> server log.
> 
> A) But when I am checking it at hyeprvisor level, the volume is on new SR
> i.e. on LUN14. Strange huh? So now the new uuid for this volume from XE cli
> is like
> 
> [root@gcx-bom-compute1 ~]# xe vbd-list
> vm-uuid=3fcb3070-e373-3cf9-d0aa-0a657142a38d
> uuid ( RO)             : f15dc54a-3868-8de8-5427-314e341879c6
>           vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
>     vm-name-label ( RO): i-22-803-VM
>          vdi-uuid ( RO): cc1f8e83-f224-44b7-9359-282a1c1e3db1
>             empty ( RO): false
>            device ( RO): hdb
> 
> B) But luckily I had the entry taken before migration  and it shows like:-
> 
> uuid ( RO) : f15dc54a-3868-8de8-5427-314e341879c6
> vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d
> vm-name-label ( RO): i-22-803-VM
> vdi-uuid ( RO): 7c073522-a077-41a0-b9a7-7b61847d413b
> empty ( RO): false
> device ( RO): hdb
> 
> C) Since this failed at cloudstack, the DB is still holding old value.
> Here is current volume table entry in DB
> 
> id: 1004
>>                 account_id: 22
>>                  domain_id: 15
>>                    pool_id: 18
>>               last_pool_id: NULL
>>                instance_id: 803
>>                  device_id: 1
>>                       name:
>> cloudx_globalcloudxchange_com_W2797T2808S3112_V1462960751
>>                       uuid: a8f01042-d0de-4496-98fa-a0b13648bef7
>>                       size: 268435456000
>>                     folder: NULL
>>                       path: 7c073522-a077-41a0-b9a7-7b61847d413b
>>                     pod_id: NULL
>>             data_center_id: 2
>>                 iscsi_name: NULL
>>                    host_ip: NULL
>>                volume_type: DATADISK
>>                  pool_type: NULL
>>           disk_offering_id: 6
>>                template_id: NULL
>> first_snapshot_backup_uuid: NULL
>>                recreatable: 0
>>                    created: 2016-05-11 09:59:12
>>                   attached: 2016-05-11 09:59:21
>>                    updated: 2016-08-06 14:30:57
>>                    removed: NULL
>>                      state: Ready
>>                 chain_info: NULL
>>               update_count: 42
>>                  disk_type: NULL
>>     vm_snapshot_chain_size: NULL
>>                     iso_id: NULL
>>             display_volume: 1
>>                     format: VHD
>>                   min_iops: NULL
>>                   max_iops: NULL
>>              hv_ss_reserve: 0
>> 1 row in set (0.00 sec)
>>
> 
> 
> So the path variable shows value as 7c073522-a077-41a0-b9a7-7b61847d413b
> and pool id as 18.
> 
> The VM is running as of now, but I am sure the moment I will reboot, this
> volume will be gone or worst VM won't boot. This is production VM BTW.
> 
> D) So I think I need to edit volume table for path and pool_id parameters
> and need to place new values in place and then reboot VM. Do I need to make
> any more changes in DB in some other tables for same? Any comment/help is
> much appreciated.
> 
> 
> 
> 
> --
> Best,
> Makrand
>