You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Jorge Luiz Correa <jo...@embrapa.br.INVALID> on 2023/04/26 19:24:41 UTC

Problem migrating big volume between primary storage pools.

Anyone had problems when migrating "big" volumes between different pools? I
have 3 storage pools. The overprovisioning factor was configured with 2.0
(default) and pool2 got full. So, I've configured factor as 1.0 and then
had to move some volumes from pool2 to pool3.

CS 4.17.2.0, Ubuntu 22.04 LTS. I'm using KVM with NFS. Same zone, same pod,
same cluster. All hosts (hypervisors) had all 3 pools mounted. I've tried
two ways:

1) from instance details page, with instance stopped, using the option
"Migrate instance to another primary storage" (when instance is running
this option is named "Migrate instance to another host"). Then, I've marked
"Migrate all volume(s) of the instance to a single primary storage" and
choose the destination primary storage pool3.

2) from volume details page, with instance stopped, using the option
"Migrate volume" and then selecting the destination primary storage pool3.

Both methods didn't work with a volume of 1.1 TB. Do they do the same thing?

Looking at the host that executes the action, I can see that it mounts the
Secondary Storage, starts a "qemu-img convert" process to generate a new
volume. After some time (3 hours) and copy 1.1 TB, the process fail with:

com.cloud.utils.exception.CloudRuntimeException: Resource [StoragePool:8]
is unreachable: Migrate volume failed:
com.cloud.utils.exception.CloudRuntimeException: Failed to copy
/mnt/4be0a812-1d87-376f-9e72-db79206a796c/565fa2dd-ff14-4b28-a5d0-dbe88b860ee9
to d3d5a858-285c-452b-b33f-c152c294711b.qcow2

I checked in the database that StoragePool:8 is pool3, the destination.

After failing, the async job is finished. But, the new qcow2 file remains
at secondary storage, lost.

So, the host is saying it can't access the pool3. BUT, this pool is
mounted! There are other VMs running using this pool3. And, I've
successfully migrated many others VMs using 1) or 2), but these VMs had up
to 100 GB.

I'm using

job.cancel.threshold.minutes: 480
migratewait: 28800
storage.pool.max.waitseconds: 28800
wait: 28800

so, no log messages about timeouts.

Any help?

Thank you :)

-- 
Jorge Luiz Corrêa
Embrapa Agricultura Digital

echo "CkpvcmdlIEx1aXogQ29ycmVhCkFu
YWxpc3RhIGRlIFJlZGVzIGUgU2VndXJhbm
NhCkVtYnJhcGEgQWdyaWN1bHR1cmEgRGln
aXRhbCAtIE5USQpBdi4gQW5kcmUgVG9zZW
xsbywgMjA5IChCYXJhbyBHZXJhbGRvKQpD
RVAgMTMwODMtODg2IC0gQ2FtcGluYXMsIF
NQClRlbGVmb25lOiAoMTkpIDMyMTEtNTg4
Mgpqb3JnZS5sLmNvcnJlYUBlbWJyYXBhLm
JyCgo="|base64 -d

-- 
__________________________
Aviso de confidencialidade

Esta mensagem da 
Empresa  Brasileira de Pesquisa  Agropecuaria (Embrapa), empresa publica 
federal  regida pelo disposto  na Lei Federal no. 5.851,  de 7 de dezembro 
de 1972,  e  enviada exclusivamente  a seu destinatario e pode conter 
informacoes  confidenciais, protegidas  por sigilo profissional.  Sua 
utilizacao desautorizada  e ilegal e  sujeita o infrator as penas da lei. 
Se voce  a recebeu indevidamente, queira, por gentileza, reenvia-la ao 
emitente, esclarecendo o equivoco.

Confidentiality note

This message from 
Empresa  Brasileira de Pesquisa  Agropecuaria (Embrapa), a government 
company  established under  Brazilian law (5.851/72), is directed 
exclusively to  its addressee  and may contain confidential data,  
protected under  professional secrecy  rules. Its unauthorized  use is 
illegal and  may subject the transgressor to the law's penalties. If you 
are not the addressee, please send it back, elucidating the failure.

Re: Problem migrating big volume between primary storage pools.

Posted by Jorge Luiz Correa <jo...@embrapa.br.INVALID>.
Thank you so much Bryan! It worked!

I would like to comment on two things.

The migration I was trying, using web gui options and the secondary storage
as intermediate, probably was failing because of two timeout parameters.

kvm.storage.offline.migration.wait: 28800
kvm.storage.online.migration.wait: 28800

The default value was 10800 and I've noted that two attempts stopped
exactly in 3 hours. I didn't know these parameters, so I think if I try now
the volume could be copied.

So, beyond job.cancel.threshold.minutes, migratewait,
storage.pool.max.waitseconds and wait, we need to configure
kvm.storage.offline.migration.wait and kvm.storage.online.migration.wait
too.

To use

(admin@uds) 🐱 > migrate virtualmachinewithvolume hostid=UUID
virtualmachineid=UUID migrateto[0].volume=UUID migrateto[0].pool=UUID

I had to configure timeout too:

(admin@uds) 🐱 > set timeout 28800

so, everything worked :) after 3h40min, the 1.6 TB volume was live
migrated.

Thank you :)

Em qua., 26 de abr. de 2023 às 16:59, Bryan Lima <br...@scclouds.com.br>
escreveu:

> Hey Jorge,
>
> Nice to see another fellow around!
>
> > Both methods didn't work with a volume of 1.1 TB. Do they do the same
> > thing?
> Both methods have different validations; however, essentially they do
> the same thing: while the VM is stopped, the volume is copied to the
> secondary storage and then to the primary storage. On the other hand,
> when the VM is running, ACS copies the volume directly to the
> destination pool. Could you try migrating these volumes while the VM is
> still running (using API *migrateVirtualMachineWithVolume*)? In this
> scenario, the migration would not copy the volumes to the secondary
> storage; thus, it would be faster and reduce the stress/load in your
> network and storage systems. Let me know if this option worked for you
> or if you have any doubts about how to use the live migration with KVM.
>
> Besides that, we have seen some problems when this migration process is
> not finished properly, which leaves leftovers in the storage pool,
> consuming valuable storage resources and database inconsistencies. It is
> worth taking a look at the storage pool for these files and also
> validating the database, to see if inconsistencies were created there.
>
> Best regards,
> Bryan
> On 26/04/2023 16:24, Jorge Luiz Correa wrote:
> > Anyone had problems when migrating "big" volumes between different
> pools? I
> > have 3 storage pools. The overprovisioning factor was configured with 2.0
> > (default) and pool2 got full. So, I've configured factor as 1.0 and then
> > had to move some volumes from pool2 to pool3.
> >
> > CS 4.17.2.0, Ubuntu 22.04 LTS. I'm using KVM with NFS. Same zone, same
> pod,
> > same cluster. All hosts (hypervisors) had all 3 pools mounted. I've tried
> > two ways:
> >
> > 1) from instance details page, with instance stopped, using the option
> > "Migrate instance to another primary storage" (when instance is running
> > this option is named "Migrate instance to another host"). Then, I've
> marked
> > "Migrate all volume(s) of the instance to a single primary storage" and
> > choose the destination primary storage pool3.
> >
> > 2) from volume details page, with instance stopped, using the option
> > "Migrate volume" and then selecting the destination primary storage
> pool3.
> >
> > Both methods didn't work with a volume of 1.1 TB. Do they do the same
> thing?
> >
> > Looking at the host that executes the action, I can see that it mounts
> the
> > Secondary Storage, starts a "qemu-img convert" process to generate a new
> > volume. After some time (3 hours) and copy 1.1 TB, the process fail with:
> >
> > com.cloud.utils.exception.CloudRuntimeException: Resource [StoragePool:8]
> > is unreachable: Migrate volume failed:
> > com.cloud.utils.exception.CloudRuntimeException: Failed to copy
> >
> /mnt/4be0a812-1d87-376f-9e72-db79206a796c/565fa2dd-ff14-4b28-a5d0-dbe88b860ee9
> > to d3d5a858-285c-452b-b33f-c152c294711b.qcow2
> >
> > I checked in the database that StoragePool:8 is pool3, the destination.
> >
> > After failing, the async job is finished. But, the new qcow2 file remains
> > at secondary storage, lost.
> >
> > So, the host is saying it can't access the pool3. BUT, this pool is
> > mounted! There are other VMs running using this pool3. And, I've
> > successfully migrated many others VMs using 1) or 2), but these VMs had
> up
> > to 100 GB.
> >
> > I'm using
> >
> > job.cancel.threshold.minutes: 480
> > migratewait: 28800
> > storage.pool.max.waitseconds: 28800
> > wait: 28800
> >
> > so, no log messages about timeouts.
> >
> > Any help?
> >
> > Thank you :)
> >

-- 
__________________________
Aviso de confidencialidade

Esta mensagem da 
Empresa  Brasileira de Pesquisa  Agropecuaria (Embrapa), empresa publica 
federal  regida pelo disposto  na Lei Federal no. 5.851,  de 7 de dezembro 
de 1972,  e  enviada exclusivamente  a seu destinatario e pode conter 
informacoes  confidenciais, protegidas  por sigilo profissional.  Sua 
utilizacao desautorizada  e ilegal e  sujeita o infrator as penas da lei. 
Se voce  a recebeu indevidamente, queira, por gentileza, reenvia-la ao 
emitente, esclarecendo o equivoco.

Confidentiality note

This message from 
Empresa  Brasileira de Pesquisa  Agropecuaria (Embrapa), a government 
company  established under  Brazilian law (5.851/72), is directed 
exclusively to  its addressee  and may contain confidential data,  
protected under  professional secrecy  rules. Its unauthorized  use is 
illegal and  may subject the transgressor to the law's penalties. If you 
are not the addressee, please send it back, elucidating the failure.

Re: Problem migrating big volume between primary storage pools.

Posted by Bryan Lima <br...@scclouds.com.br>.
Hey Jorge,

Nice to see another fellow around!

> Both methods didn't work with a volume of 1.1 TB. Do they do the same 
> thing?
Both methods have different validations; however, essentially they do 
the same thing: while the VM is stopped, the volume is copied to the 
secondary storage and then to the primary storage. On the other hand, 
when the VM is running, ACS copies the volume directly to the 
destination pool. Could you try migrating these volumes while the VM is 
still running (using API *migrateVirtualMachineWithVolume*)? In this 
scenario, the migration would not copy the volumes to the secondary 
storage; thus, it would be faster and reduce the stress/load in your 
network and storage systems. Let me know if this option worked for you 
or if you have any doubts about how to use the live migration with KVM.

Besides that, we have seen some problems when this migration process is 
not finished properly, which leaves leftovers in the storage pool, 
consuming valuable storage resources and database inconsistencies. It is 
worth taking a look at the storage pool for these files and also 
validating the database, to see if inconsistencies were created there.

Best regards,
Bryan
On 26/04/2023 16:24, Jorge Luiz Correa wrote:
> Anyone had problems when migrating "big" volumes between different pools? I
> have 3 storage pools. The overprovisioning factor was configured with 2.0
> (default) and pool2 got full. So, I've configured factor as 1.0 and then
> had to move some volumes from pool2 to pool3.
>
> CS 4.17.2.0, Ubuntu 22.04 LTS. I'm using KVM with NFS. Same zone, same pod,
> same cluster. All hosts (hypervisors) had all 3 pools mounted. I've tried
> two ways:
>
> 1) from instance details page, with instance stopped, using the option
> "Migrate instance to another primary storage" (when instance is running
> this option is named "Migrate instance to another host"). Then, I've marked
> "Migrate all volume(s) of the instance to a single primary storage" and
> choose the destination primary storage pool3.
>
> 2) from volume details page, with instance stopped, using the option
> "Migrate volume" and then selecting the destination primary storage pool3.
>
> Both methods didn't work with a volume of 1.1 TB. Do they do the same thing?
>
> Looking at the host that executes the action, I can see that it mounts the
> Secondary Storage, starts a "qemu-img convert" process to generate a new
> volume. After some time (3 hours) and copy 1.1 TB, the process fail with:
>
> com.cloud.utils.exception.CloudRuntimeException: Resource [StoragePool:8]
> is unreachable: Migrate volume failed:
> com.cloud.utils.exception.CloudRuntimeException: Failed to copy
> /mnt/4be0a812-1d87-376f-9e72-db79206a796c/565fa2dd-ff14-4b28-a5d0-dbe88b860ee9
> to d3d5a858-285c-452b-b33f-c152c294711b.qcow2
>
> I checked in the database that StoragePool:8 is pool3, the destination.
>
> After failing, the async job is finished. But, the new qcow2 file remains
> at secondary storage, lost.
>
> So, the host is saying it can't access the pool3. BUT, this pool is
> mounted! There are other VMs running using this pool3. And, I've
> successfully migrated many others VMs using 1) or 2), but these VMs had up
> to 100 GB.
>
> I'm using
>
> job.cancel.threshold.minutes: 480
> migratewait: 28800
> storage.pool.max.waitseconds: 28800
> wait: 28800
>
> so, no log messages about timeouts.
>
> Any help?
>
> Thank you :)
>