You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by Sean Lair <sl...@ippathways.com> on 2019/01/22 16:30:22 UTC

Snapshots on KVM corrupting disk images

Hi all,

We had some instances where VM disks are becoming corrupted when using KVM snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.

The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up.  We had to restore all those VM disks...  But believed it was just our fault with letting secondary storage fill up.

Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot.  here is the output of some commands:

-----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-----------------------

We tried restoring to before the snapshot failure, but still have strange errors:

----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 73G
cluster_size: 65536
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
Format specific information:
    compat: 1.1
    lazy refcounts: false

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
No errors were found on the image.

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
--------------------------

Everyone is now extremely hesitant to use snapshots in KVM....  We tried deleting the snapshots in the restored disk image, but it errors out...


Does anyone else have issues with KVM snapshots?  We are considering just disabling this functionality now...

Thanks
Sean

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Hi Simon

It is NFS mount.  The underlying storage is NetApp that we run a lot of different environments on, it is rock-solid, the only issues we've had are with KVM snapshots.

Thanks
Sean

-----Original Message-----
From: Simon Weller [mailto:sweller@ena.com.INVALID] 
Sent: Tuesday, January 22, 2019 10:42 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Sean,


What underlying primary storage are you using and how is it being utilized by ACS (e.g. NFS, shared mount et al)?



- Si


________________________________
From: Sean Lair <sl...@ippathways.com>
Sent: Tuesday, January 22, 2019 10:30 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Snapshots on KVM corrupting disk images

Hi all,

We had some instances where VM disks are becoming corrupted when using KVM snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.

The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up.  We had to restore all those VM disks...  But believed it was just our fault with letting secondary storage fill up.

Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot.  here is the output of some commands:

-----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-----------------------

We tried restoring to before the snapshot failure, but still have strange errors:

----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 73G
cluster_size: 65536
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
Format specific information:
    compat: 1.1
    lazy refcounts: false

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
--------------------------

Everyone is now extremely hesitant to use snapshots in KVM....  We tried deleting the snapshots in the restored disk image, but it errors out...


Does anyone else have issues with KVM snapshots?  We are considering just disabling this functionality now...

Thanks
Sean

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Hi Simon

It is NFS mount.  The underlying storage is NetApp that we run a lot of different environments on, it is rock-solid, the only issues we've had are with KVM snapshots.

Thanks
Sean

-----Original Message-----
From: Simon Weller [mailto:sweller@ena.com.INVALID] 
Sent: Tuesday, January 22, 2019 10:42 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Sean,


What underlying primary storage are you using and how is it being utilized by ACS (e.g. NFS, shared mount et al)?



- Si


________________________________
From: Sean Lair <sl...@ippathways.com>
Sent: Tuesday, January 22, 2019 10:30 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Snapshots on KVM corrupting disk images

Hi all,

We had some instances where VM disks are becoming corrupted when using KVM snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.

The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up.  We had to restore all those VM disks...  But believed it was just our fault with letting secondary storage fill up.

Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot.  here is the output of some commands:

-----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-----------------------

We tried restoring to before the snapshot failure, but still have strange errors:

----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 73G
cluster_size: 65536
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
Format specific information:
    compat: 1.1
    lazy refcounts: false

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
--------------------------

Everyone is now extremely hesitant to use snapshots in KVM....  We tried deleting the snapshots in the restored disk image, but it errors out...


Does anyone else have issues with KVM snapshots?  We are considering just disabling this functionality now...

Thanks
Sean

Re: Snapshots on KVM corrupting disk images

Posted by Simon Weller <sw...@ena.com.INVALID>.

Sean,


What underlying primary storage are you using and how is it being utilized by ACS (e.g. NFS, shared mount et al)?



- Si


________________________________
From: Sean Lair <sl...@ippathways.com>
Sent: Tuesday, January 22, 2019 10:30 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Snapshots on KVM corrupting disk images

Hi all,

We had some instances where VM disks are becoming corrupted when using KVM snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.

The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up.  We had to restore all those VM disks...  But believed it was just our fault with letting secondary storage fill up.

Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot.  here is the output of some commands:

-----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-----------------------

We tried restoring to before the snapshot failure, but still have strange errors:

----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 73G
cluster_size: 65536
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
Format specific information:
    compat: 1.1
    lazy refcounts: false

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
No errors were found on the image.

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
--------------------------

Everyone is now extremely hesitant to use snapshots in KVM....  We tried deleting the snapshots in the restored disk image, but it errors out...


Does anyone else have issues with KVM snapshots?  We are considering just disabling this functionality now...

Thanks
Sean

Re: Snapshots on KVM corrupting disk images

Posted by Simon Weller <sw...@ena.com.INVALID>.

Sean,


What underlying primary storage are you using and how is it being utilized by ACS (e.g. NFS, shared mount et al)?



- Si


________________________________
From: Sean Lair <sl...@ippathways.com>
Sent: Tuesday, January 22, 2019 10:30 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Snapshots on KVM corrupting disk images

Hi all,

We had some instances where VM disks are becoming corrupted when using KVM snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.

The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up.  We had to restore all those VM disks...  But believed it was just our fault with letting secondary storage fill up.

Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot.  here is the output of some commands:

-----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-----------------------

We tried restoring to before the snapshot failure, but still have strange errors:

----------------------
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 73G
cluster_size: 65536
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
Format specific information:
    compat: 1.1
    lazy refcounts: false

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
No errors were found on the image.

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 3099:35:55.242
2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 3431:52:23.942
--------------------------

Everyone is now extremely hesitant to use snapshots in KVM....  We tried deleting the snapshots in the restored disk image, but it errors out...


Does anyone else have issues with KVM snapshots?  We are considering just disabling this functionality now...

Thanks
Sean

Re: Snapshots on KVM corrupting disk images

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.

I've met the situations when CLOUDSTACK+KVM+QCOW2+SNAPSHOTS led to
corrupted images, mostly in 4.3 and NFS, but I've thought that CS stops VM
just before it does the snapshot. At least the VM behavior when the VM
snapshot is created looks like it happens (freezing). That's why it looks
strange. But, in general, I agree, that the above bundle leads to data
corruption, especially when the storage is under IO pressure. We recommend
our customers avoiding running snapshots if possible for such a bundle.

ср, 23 янв. 2019 г. в 05:06, Wei ZHOU <us...@gmail.com>:

> Hi Sean,
>
> The (recurring) volume snapshot on running vms should be disabled in
> cloudstack.
>
> According to some discussions (for example
> https://bugzilla.redhat.com/show_bug.cgi?id=920020), the image might be
> corrupted due to the concurrent read/write operations in volume snapshot
> (by qemu-img snapshot).
>
> ```
>
> qcow2 images must not be used in read-write mode from two processes at the
> same
> time. You can either have them opened either by one read-write process or
> by
> many read-only processes. Having one (paused) read-write process (the
> running
> VM) and additional read-only processes (copying out a snapshot with
> qemu-img)
> may happen to work in practice, but you're on your own and we won't give
> support for such attempts.
>
> ```
> The safe way to take a volume snapshot of running vm is
> (1) take a vm snapshot (vm will be paused)
> (2) then create a volume snapshot from the vm snapshot
>
> -Wei
>
>
>
> Sean Lair <sl...@ippathways.com> 于2019年1月22日周二 下午5:30写道：
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when using
> KVM
> > snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on a lot
> > of large number VMs and secondary storage filled up.  We had to restore
> all
> > those VM disks...  But believed it was just our fault with letting
> > secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk image
> is
> > corrupted and the VM can't boot.  here is the output of some commands:
> >
> > -----------------------
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> > not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> > not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -----------------------
> >
> > We tried restoring to before the snapshot failure, but still have strange
> > errors:
> >
> > ----------------------
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID        TAG                 VM SIZE                DATE       VM CLOCK
> > 1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> > Format specific information:
> >     compat: 1.1
> >     lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc
> 0x55d16ddf2541
> > 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6
> 0x7fb9c63a3c05
> > 0x55d16ddd9f7d
> > No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> snapshot
> > -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID        TAG                 VM SIZE                DATE       VM CLOCK
> > 1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> > --------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM....  We tried
> > deleting the snapshots in the restored disk image, but it errors out...
> >
> >
> > Does anyone else have issues with KVM snapshots?  We are considering just
> > disabling this functionality now...
> >
> > Thanks
> > Sean
> >
> >
> >
> >
> >
> >
> >
>


-- 
With best regards, Ivan Kudryavtsev
Bitworks LLC
Cell RU: +7-923-414-1515
Cell USA: +1-201-257-1512
WWW: http://bitworks.software/ <http://bw-sw.com/>

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Thanks Wei!  We really appreciate the response and the link.

Shouldn't we be doing something to stop the ability to use snapshots (scheduled and other snapshot operations) in CloudStack?  

-----Original Message-----
From: Wei ZHOU [mailto:ustcweizhou@gmail.com] 
Sent: Tuesday, January 22, 2019 4:06 PM
To: dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Hi Sean,

The (recurring) volume snapshot on running vms should be disabled in cloudstack.

According to some discussions (for example https://bugzilla.redhat.com/show_bug.cgi?id=920020), the image might be corrupted due to the concurrent read/write operations in volume snapshot (by qemu-img snapshot).

```

qcow2 images must not be used in read-write mode from two processes at the same time. You can either have them opened either by one read-write process or by many read-only processes. Having one (paused) read-write process (the running
VM) and additional read-only processes (copying out a snapshot with qemu-img) may happen to work in practice, but you're on your own and we won't give support for such attempts.

```
The safe way to take a volume snapshot of running vm is
(1) take a vm snapshot (vm will be paused)
(2) then create a volume snapshot from the vm snapshot

-Wei



Sean Lair <sl...@ippathways.com> 于2019年1月22日周二 下午5:30写道：

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using 
> KVM snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a 
> lot of large number VMs and secondary storage filled up.  We had to 
> restore all those VM disks...  But believed it was just our fault with 
> letting secondary storage fill up.
>
> Today we had an instance where a snapshot failed and now the disk 
> image is corrupted and the VM can't boot.  here is the output of some commands:
>
> -----------------------
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -----------------------
>
> We tried restoring to before the snapshot failure, but still have 
> strange
> errors:
>
> ----------------------
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> file format: qcow2
> virtual size: 50G (53687091200 bytes)
> disk size: 73G
> cluster_size: 65536
> Snapshot list:
> ID        TAG                 VM SIZE                DATE       VM CLOCK
> 1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> 2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> Format specific information:
>     compat: 1.1
>     lazy refcounts: false
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 
> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on 
> the image.
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> Snapshot list:
> ID        TAG                 VM SIZE                DATE       VM CLOCK
> 1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> 2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> --------------------------
>
> Everyone is now extremely hesitant to use snapshots in KVM....  We 
> tried deleting the snapshots in the restored disk image, but it errors out...
>
>
> Does anyone else have issues with KVM snapshots?  We are considering 
> just disabling this functionality now...
>
> Thanks
> Sean
>
>
>
>
>
>
>

Re: Snapshots on KVM corrupting disk images

Posted by Wei ZHOU <us...@gmail.com>.

Hi Sean,

The (recurring) volume snapshot on running vms should be disabled in
cloudstack.

According to some discussions (for example
https://bugzilla.redhat.com/show_bug.cgi?id=920020), the image might be
corrupted due to the concurrent read/write operations in volume snapshot
(by qemu-img snapshot).

```

qcow2 images must not be used in read-write mode from two processes at the same
time. You can either have them opened either by one read-write process or by
many read-only processes. Having one (paused) read-write process (the running
VM) and additional read-only processes (copying out a snapshot with qemu-img)
may happen to work in practice, but you're on your own and we won't give
support for such attempts.

```
The safe way to take a volume snapshot of running vm is
(1) take a vm snapshot (vm will be paused)
(2) then create a volume snapshot from the vm snapshot

-Wei



Sean Lair <sl...@ippathways.com> 于2019年1月22日周二 下午5:30写道：

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using KVM
> snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a lot
> of large number VMs and secondary storage filled up.  We had to restore all
> those VM disks...  But believed it was just our fault with letting
> secondary storage fill up.
>
> Today we had an instance where a snapshot failed and now the disk image is
> corrupted and the VM can't boot.  here is the output of some commands:
>
> -----------------------
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -----------------------
>
> We tried restoring to before the snapshot failure, but still have strange
> errors:
>
> ----------------------
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> file format: qcow2
> virtual size: 50G (53687091200 bytes)
> disk size: 73G
> cluster_size: 65536
> Snapshot list:
> ID        TAG                 VM SIZE                DATE       VM CLOCK
> 1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> 2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> Format specific information:
>     compat: 1.1
>     lazy refcounts: false
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541
> 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05
> 0x55d16ddd9f7d
> No errors were found on the image.
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot
> -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> Snapshot list:
> ID        TAG                 VM SIZE                DATE       VM CLOCK
> 1         a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> 2         b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> --------------------------
>
> Everyone is now extremely hesitant to use snapshots in KVM....  We tried
> deleting the snapshots in the restored disk image, but it errors out...
>
>
> Does anyone else have issues with KVM snapshots?  We are considering just
> disabling this functionality now...
>
> Thanks
> Sean
>
>
>
>
>
>
>

RE: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

Yes, that's the scariest thing: you never know that the image is corrupted on the same day. Usually, a week or a fortnight could pass before one gets to know about a problem (and all old snapshots are successfully removed by that time).

Some time ago I implemented a simple script that runs `qemu-img check` on each image on a daily basis, but then I had to give this idea up, because `qemu-img check` usually can show a lot of errors on a running instance's volume, it could show some truth only when the instance is stopped. :-(

Here is a bit of advice.
1. First of all, never make a snapshot when the VM shows high I/O activity. I implemented an SNMP-agent that shows I/O activity of all VMs under a certain MIB, but I also had to implement another application to manage snapshots, it creates a new snapshot only when it's pretty sure that the VM doesn't write a lot of data to the storage. I'd gladly share it, but implementing all these things could be a bit tricky thing, I need some time to document it. Of course, you always can implement your own solution for that. Maybe it would be a nice idea to implement this in ACS itself. :)
2. Consider dropping caches every hour (`/bin/echo 1 > /proc/sys/vm/drop_caches`). I found some correlation between corrupting images and cache's overflow.

I'm still not 100% sure it can guarantee you calm sleeping in the night, but my statistics (~600 VMs on different hosts, clusters, pods and zones) show that implementing these things was a correct step (knocking on wood, spitting over the left shoulder, etc.).

Good luck!


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, 1 February 2019 22:01, Sean Lair <sl...@ippathways.com> wrote:

> Hello,
>
> We are using NFS storage. It is actually native NFS mounts on a NetApp storage system. We haven't seen those log entries, but we also don't always know when a VM gets corrupted... When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>
> -----Original Message-----
> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
> Sent: Sunday, January 27, 2019 1:45 PM
> To: users@cloudstack.apache.org
> Cc: dev@cloudstack.apache.org
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>
> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair slair@ippathways.com wrote:
>
> > Hi all,
> > We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
> > Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > We tried restoring to before the snapshot failure, but still have strange errors:
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942 Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
> > Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
> > Thanks
> > Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

Dear colleagues,

Have anyone upgraded to 4.11.3? This version includes a patch that should help to avoid encountering with this problem: https://github.com/apache/cloudstack/pull/3194. It would be great to know if it has helped you.

Thanks in advance for sharing your experience.

Best regards,
a big CloudStack fan :)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, 5 February 2019 12:25, cloudstack-fan <cl...@protonmail.com> wrote:

> And one more thought, by the way.
>
> There's a cool new feature - asynchronous backup (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot). It allows to create a snapshot at one moment and back it up in another. It would be amazing if it gave opportunity to perform the snapshot deletion procedure (I mean deletion from a primary storage) as a separate operation. So I could check if I/O-activity is low before to _delete_ a snapshot from a primary storage, not only before to _create_ it, it could be a nice workaround.
>
> Dear colleagues, what do you think, is it doable?
>
> Thank you!
>
> Best regards,
> a big CloudStack fan :)
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, 4 February 2019 07:46, cloudstack-fan <cl...@protonmail.com> wrote:
>
>> By the way, RedHat recommended to suspend a VM before deleting a snapshot too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
>>
>>> 1. Pause the VM
>>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>>      of the running VM, not with an external qemu-img process. virsh may or may
>>>      not provide an interface for this.
>>>   3. You can resume the VM now
>>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>>   5. Pause the VM again
>>>   6. 'delvm' in the qemu monitor
>>>   7. Resume the VM
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Monday, 4 February 2019 07:36, cloudstack-fan <cl...@protonmail.com> wrote:
>>
>>> I'd also like to add another detail, if no one minds.
>>>
>>> Sometimes one can run into this issue without shutting down a VM. The disaster might occur right after a snapshot is copied to a secondary storage and deleted from the VM's image on the primary storage. I saw it a couple of times, when it happened to the VMs being monitored. The monitoring suite showed that these VMs were working fine right until the final phase (apart from a short pause of the snapshot creating stage).
>>>
>>> I also noticed that a VM is always suspended when a snapshot is being created and `virsh list` shows it's in the "paused" state, but when a snapshot is being deleted from the image the same command always shows the "running" state, although the VM doesn't respond to anything during the snapshot deletion phase.
>>>
>>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the same issue (see https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ and other similar threads), but it also would be great to make some workaround for ACS. Maybe, just as you proposed, it would be wise to suspend the VM before snapshot deletion and resume it after that. It would give ACS a serious advantage over other orchestration systems. :-)
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <ku...@bw-sw.com> wrote:
>>>
>>>> Yes, only after the VM shutdown, the image is corrupted.
>>>>
>>>> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>>>>>
>>>>> -----Original Message-----
>>>>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>>> To: users@cloudstack.apache.org
>>>>> Cc: dev@cloudstack.apache.org
>>>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>>>
>>>>> Hello Sean,
>>>>>
>>>>> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>>>
>>>>> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>>>>>
>>>>> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
>>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>>>>>
>>>>> I hope, things will be well. Wish you good luck and all the best!
>>>>>
>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>>>>
>>>>>> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>>>>
>>>>>> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> -----------------------------------------------------------
>>>>>>
>>>>>> We tried restoring to before the snapshot failure, but still have strange errors:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> --------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> file format: qcow2
>>>>>> virtual size: 50G (53687091200 bytes)
>>>>>> disk size: 73G
>>>>>> cluster_size: 65536
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942 Format specific information:
>>>>>> compat: 1.1
>>>>>> lazy refcounts: false
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>>>>>>
>>>>>> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>>>>>>
>>>>>> Thanks
>>>>>> Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

Dear colleagues,

Have anyone upgraded to 4.11.3? This version includes a patch that should help to avoid encountering with this problem: https://github.com/apache/cloudstack/pull/3194. It would be great to know if it has helped you.

Thanks in advance for sharing your experience.

Best regards,
a big CloudStack fan :)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, 5 February 2019 12:25, cloudstack-fan <cl...@protonmail.com> wrote:

> And one more thought, by the way.
>
> There's a cool new feature - asynchronous backup (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot). It allows to create a snapshot at one moment and back it up in another. It would be amazing if it gave opportunity to perform the snapshot deletion procedure (I mean deletion from a primary storage) as a separate operation. So I could check if I/O-activity is low before to _delete_ a snapshot from a primary storage, not only before to _create_ it, it could be a nice workaround.
>
> Dear colleagues, what do you think, is it doable?
>
> Thank you!
>
> Best regards,
> a big CloudStack fan :)
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, 4 February 2019 07:46, cloudstack-fan <cl...@protonmail.com> wrote:
>
>> By the way, RedHat recommended to suspend a VM before deleting a snapshot too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
>>
>>> 1. Pause the VM
>>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>>      of the running VM, not with an external qemu-img process. virsh may or may
>>>      not provide an interface for this.
>>>   3. You can resume the VM now
>>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>>   5. Pause the VM again
>>>   6. 'delvm' in the qemu monitor
>>>   7. Resume the VM
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Monday, 4 February 2019 07:36, cloudstack-fan <cl...@protonmail.com> wrote:
>>
>>> I'd also like to add another detail, if no one minds.
>>>
>>> Sometimes one can run into this issue without shutting down a VM. The disaster might occur right after a snapshot is copied to a secondary storage and deleted from the VM's image on the primary storage. I saw it a couple of times, when it happened to the VMs being monitored. The monitoring suite showed that these VMs were working fine right until the final phase (apart from a short pause of the snapshot creating stage).
>>>
>>> I also noticed that a VM is always suspended when a snapshot is being created and `virsh list` shows it's in the "paused" state, but when a snapshot is being deleted from the image the same command always shows the "running" state, although the VM doesn't respond to anything during the snapshot deletion phase.
>>>
>>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the same issue (see https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ and other similar threads), but it also would be great to make some workaround for ACS. Maybe, just as you proposed, it would be wise to suspend the VM before snapshot deletion and resume it after that. It would give ACS a serious advantage over other orchestration systems. :-)
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <ku...@bw-sw.com> wrote:
>>>
>>>> Yes, only after the VM shutdown, the image is corrupted.
>>>>
>>>> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>>>>>
>>>>> -----Original Message-----
>>>>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>>> To: users@cloudstack.apache.org
>>>>> Cc: dev@cloudstack.apache.org
>>>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>>>
>>>>> Hello Sean,
>>>>>
>>>>> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>>>
>>>>> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>>>>>
>>>>> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
>>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>>>>>
>>>>> I hope, things will be well. Wish you good luck and all the best!
>>>>>
>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>>>>
>>>>>> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>>>>
>>>>>> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> -----------------------------------------------------------
>>>>>>
>>>>>> We tried restoring to before the snapshot failure, but still have strange errors:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> --------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> file format: qcow2
>>>>>> virtual size: 50G (53687091200 bytes)
>>>>>> disk size: 73G
>>>>>> cluster_size: 65536
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942 Format specific information:
>>>>>> compat: 1.1
>>>>>> lazy refcounts: false
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>>>>>>
>>>>>> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>>>>>>
>>>>>> Thanks
>>>>>> Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

And one more thought, by the way.

There's a cool new feature - asynchronous backup (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot). It allows to create a snapshot at one moment and back it up in another. It would be amazing if it gave opportunity to perform the snapshot deletion procedure (I mean deletion from a primary storage) as a separate operation. So I could check if I/O-activity is low before to _delete_ a snapshot from a primary storage, not only before to _create_ it, it could be a nice workaround.

Dear colleagues, what do you think, is it doable?

Thank you!

Best regards,
a big CloudStack fan :)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, 4 February 2019 07:46, cloudstack-fan <cl...@protonmail.com> wrote:

> By the way, RedHat recommended to suspend a VM before deleting a snapshot too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
>
>> 1. Pause the VM
>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>      of the running VM, not with an external qemu-img process. virsh may or may
>>      not provide an interface for this.
>>   3. You can resume the VM now
>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>   5. Pause the VM again
>>   6. 'delvm' in the qemu monitor
>>   7. Resume the VM
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, 4 February 2019 07:36, cloudstack-fan <cl...@protonmail.com> wrote:
>
>> I'd also like to add another detail, if no one minds.
>>
>> Sometimes one can run into this issue without shutting down a VM. The disaster might occur right after a snapshot is copied to a secondary storage and deleted from the VM's image on the primary storage. I saw it a couple of times, when it happened to the VMs being monitored. The monitoring suite showed that these VMs were working fine right until the final phase (apart from a short pause of the snapshot creating stage).
>>
>> I also noticed that a VM is always suspended when a snapshot is being created and `virsh list` shows it's in the "paused" state, but when a snapshot is being deleted from the image the same command always shows the "running" state, although the VM doesn't respond to anything during the snapshot deletion phase.
>>
>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the same issue (see https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ and other similar threads), but it also would be great to make some workaround for ACS. Maybe, just as you proposed, it would be wise to suspend the VM before snapshot deletion and resume it after that. It would give ACS a serious advantage over other orchestration systems. :-)
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <ku...@bw-sw.com> wrote:
>>
>>> Yes, only after the VM shutdown, the image is corrupted.
>>>
>>> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>>>
>>>> Hello,
>>>>
>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>>>>
>>>> -----Original Message-----
>>>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>> To: users@cloudstack.apache.org
>>>> Cc: dev@cloudstack.apache.org
>>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>>
>>>> Hello Sean,
>>>>
>>>> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>>
>>>> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>>>>
>>>> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>>>>
>>>> I hope, things will be well. Wish you good luck and all the best!
>>>>
>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>>>
>>>>> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>>>
>>>>> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ------------------------------------------------
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>> Could not read snapshots: File too large
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>> Could not read snapshots: File too large
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> -----------------------------------------------------------
>>>>>
>>>>> We tried restoring to before the snapshot failure, but still have strange errors:
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> --------------
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> file format: qcow2
>>>>> virtual size: 50G (53687091200 bytes)
>>>>> disk size: 73G
>>>>> cluster_size: 65536
>>>>> Snapshot list:
>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>> 3099:35:55.242
>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>> 3431:52:23.942 Format specific information:
>>>>> compat: 1.1
>>>>> lazy refcounts: false
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> Snapshot list:
>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>> 3099:35:55.242
>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>> 3431:52:23.942
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ---------------------------------------------------------------
>>>>>
>>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>>>>>
>>>>> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>>>>>
>>>>> Thanks
>>>>> Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

And one more thought, by the way.

There's a cool new feature - asynchronous backup (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot). It allows to create a snapshot at one moment and back it up in another. It would be amazing if it gave opportunity to perform the snapshot deletion procedure (I mean deletion from a primary storage) as a separate operation. So I could check if I/O-activity is low before to _delete_ a snapshot from a primary storage, not only before to _create_ it, it could be a nice workaround.

Dear colleagues, what do you think, is it doable?

Thank you!

Best regards,
a big CloudStack fan :)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, 4 February 2019 07:46, cloudstack-fan <cl...@protonmail.com> wrote:

> By the way, RedHat recommended to suspend a VM before deleting a snapshot too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
>
>> 1. Pause the VM
>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>      of the running VM, not with an external qemu-img process. virsh may or may
>>      not provide an interface for this.
>>   3. You can resume the VM now
>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>   5. Pause the VM again
>>   6. 'delvm' in the qemu monitor
>>   7. Resume the VM
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, 4 February 2019 07:36, cloudstack-fan <cl...@protonmail.com> wrote:
>
>> I'd also like to add another detail, if no one minds.
>>
>> Sometimes one can run into this issue without shutting down a VM. The disaster might occur right after a snapshot is copied to a secondary storage and deleted from the VM's image on the primary storage. I saw it a couple of times, when it happened to the VMs being monitored. The monitoring suite showed that these VMs were working fine right until the final phase (apart from a short pause of the snapshot creating stage).
>>
>> I also noticed that a VM is always suspended when a snapshot is being created and `virsh list` shows it's in the "paused" state, but when a snapshot is being deleted from the image the same command always shows the "running" state, although the VM doesn't respond to anything during the snapshot deletion phase.
>>
>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the same issue (see https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ and other similar threads), but it also would be great to make some workaround for ACS. Maybe, just as you proposed, it would be wise to suspend the VM before snapshot deletion and resume it after that. It would give ACS a serious advantage over other orchestration systems. :-)
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <ku...@bw-sw.com> wrote:
>>
>>> Yes, only after the VM shutdown, the image is corrupted.
>>>
>>> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>>>
>>>> Hello,
>>>>
>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>>>>
>>>> -----Original Message-----
>>>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>> To: users@cloudstack.apache.org
>>>> Cc: dev@cloudstack.apache.org
>>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>>
>>>> Hello Sean,
>>>>
>>>> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>>
>>>> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>>>>
>>>> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>>>>
>>>> I hope, things will be well. Wish you good luck and all the best!
>>>>
>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>>>
>>>>> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>>>
>>>>> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ------------------------------------------------
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>> Could not read snapshots: File too large
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>> Could not read snapshots: File too large
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> -----------------------------------------------------------
>>>>>
>>>>> We tried restoring to before the snapshot failure, but still have strange errors:
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> --------------
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> file format: qcow2
>>>>> virtual size: 50G (53687091200 bytes)
>>>>> disk size: 73G
>>>>> cluster_size: 65536
>>>>> Snapshot list:
>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>> 3099:35:55.242
>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>> 3431:52:23.942 Format specific information:
>>>>> compat: 1.1
>>>>> lazy refcounts: false
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>>>>>
>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>> Snapshot list:
>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>> 3099:35:55.242
>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>> 3431:52:23.942
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ----------------------------------------------------------------------
>>>>> ---------------------------------------------------------------
>>>>>
>>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>>>>>
>>>>> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>>>>>
>>>>> Thanks
>>>>> Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

By the way, RedHat recommended to suspend a VM before deleting a snapshot too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:

> 1. Pause the VM
>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>      of the running VM, not with an external qemu-img process. virsh may or may
>      not provide an interface for this.
>   3. You can resume the VM now
>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>   5. Pause the VM again
>   6. 'delvm' in the qemu monitor
>   7. Resume the VM

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, 4 February 2019 07:36, cloudstack-fan <cl...@protonmail.com> wrote:

> I'd also like to add another detail, if no one minds.
>
> Sometimes one can run into this issue without shutting down a VM. The disaster might occur right after a snapshot is copied to a secondary storage and deleted from the VM's image on the primary storage. I saw it a couple of times, when it happened to the VMs being monitored. The monitoring suite showed that these VMs were working fine right until the final phase (apart from a short pause of the snapshot creating stage).
>
> I also noticed that a VM is always suspended when a snapshot is being created and `virsh list` shows it's in the "paused" state, but when a snapshot is being deleted from the image the same command always shows the "running" state, although the VM doesn't respond to anything during the snapshot deletion phase.
>
> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the same issue (see https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ and other similar threads), but it also would be great to make some workaround for ACS. Maybe, just as you proposed, it would be wise to suspend the VM before snapshot deletion and resume it after that. It would give ACS a serious advantage over other orchestration systems. :-)
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <ku...@bw-sw.com> wrote:
>
>> Yes, only after the VM shutdown, the image is corrupted.
>>
>> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>>
>>> Hello,
>>>
>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>>>
>>> -----Original Message-----
>>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>>> Sent: Sunday, January 27, 2019 1:45 PM
>>> To: users@cloudstack.apache.org
>>> Cc: dev@cloudstack.apache.org
>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>
>>> Hello Sean,
>>>
>>> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>
>>> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>>>
>>> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>>>
>>> I hope, things will be well. Wish you good luck and all the best!
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>>
>>>> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>>
>>>> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>>>>
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ------------------------------------------------
>>>>
>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>> Could not read snapshots: File too large
>>>>
>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>> Could not read snapshots: File too large
>>>>
>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> -----------------------------------------------------------
>>>>
>>>> We tried restoring to before the snapshot failure, but still have strange errors:
>>>>
>>>> ----------------------------------------------------------------------
>>>> --------------
>>>>
>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>
>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> file format: qcow2
>>>> virtual size: 50G (53687091200 bytes)
>>>> disk size: 73G
>>>> cluster_size: 65536
>>>> Snapshot list:
>>>> ID TAG VM SIZE DATE VM CLOCK
>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>> 3099:35:55.242
>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>> 3431:52:23.942 Format specific information:
>>>> compat: 1.1
>>>> lazy refcounts: false
>>>>
>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>>>>
>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>> Snapshot list:
>>>> ID TAG VM SIZE DATE VM CLOCK
>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>> 3099:35:55.242
>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>> 3431:52:23.942
>>>>
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>> ---------------------------------------------------------------
>>>>
>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>>>>
>>>> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>>>>
>>>> Thanks
>>>> Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

I'd also like to add another detail, if no one minds.

Sometimes one can run into this issue without shutting down a VM. The disaster might occur right after a snapshot is copied to a secondary storage and deleted from the VM's image on the primary storage. I saw it a couple of times, when it happened to the VMs being monitored. The monitoring suite showed that these VMs were working fine right until the final phase (apart from a short pause of the snapshot creating stage).

I also noticed that a VM is always suspended when a snapshot is being created and `virsh list` shows it's in the "paused" state, but when a snapshot is being deleted from the image the same command always shows the "running" state, although the VM doesn't respond to anything during the snapshot deletion phase.

It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the same issue (see https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ and other similar threads), but it also would be great to make some workaround for ACS. Maybe, just as you proposed, it would be wise to suspend the VM before snapshot deletion and resume it after that. It would give ACS a serious advantage over other orchestration systems. :-)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <ku...@bw-sw.com> wrote:

> Yes, only after the VM shutdown, the image is corrupted.
>
> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>
>> Hello,
>>
>> We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>>
>> -----Original Message-----
>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>> Sent: Sunday, January 27, 2019 1:45 PM
>> To: users@cloudstack.apache.org
>> Cc: dev@cloudstack.apache.org
>> Subject: Re: Snapshots on KVM corrupting disk images
>>
>> Hello Sean,
>>
>> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>
>> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>>
>> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>>
>> I hope, things will be well. Wish you good luck and all the best!
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>
>>> Hi all,
>>>
>>> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>
>>> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>
>>> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>>>
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ------------------------------------------------
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>> Could not read snapshots: File too large
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>> Could not read snapshots: File too large
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> -----------------------------------------------------------
>>>
>>> We tried restoring to before the snapshot failure, but still have strange errors:
>>>
>>> ----------------------------------------------------------------------
>>> --------------
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> file format: qcow2
>>> virtual size: 50G (53687091200 bytes)
>>> disk size: 73G
>>> cluster_size: 65536
>>> Snapshot list:
>>> ID TAG VM SIZE DATE VM CLOCK
>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>> 3099:35:55.242
>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>> 3431:52:23.942 Format specific information:
>>> compat: 1.1
>>> lazy refcounts: false
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> Snapshot list:
>>> ID TAG VM SIZE DATE VM CLOCK
>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>> 3099:35:55.242
>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>> 3431:52:23.942
>>>
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ---------------------------------------------------------------
>>>
>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>>>
>>> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>>>
>>> Thanks
>>> Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

Just like that cat in a box. The observer needs to open the box to learn if the cat is alive. :-)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <ku...@bw-sw.com> wrote:

> Yes, only after the VM shutdown, the image is corrupted.
>
> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>
>> Hello,
>>
>> We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.
>>
>> -----Original Message-----
>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>> Sent: Sunday, January 27, 2019 1:45 PM
>> To: users@cloudstack.apache.org
>> Cc: dev@cloudstack.apache.org
>> Subject: Re: Snapshots on KVM corrupting disk images
>>
>> Hello Sean,
>>
>> It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>
>> I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.
>>
>> I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?
>>
>> I hope, things will be well. Wish you good luck and all the best!
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>
>>> Hi all,
>>>
>>> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>
>>> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>
>>> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>>>
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ------------------------------------------------
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>> Could not read snapshots: File too large
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>> Could not read snapshots: File too large
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> -----------------------------------------------------------
>>>
>>> We tried restoring to before the snapshot failure, but still have strange errors:
>>>
>>> ----------------------------------------------------------------------
>>> --------------
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> file format: qcow2
>>> virtual size: 50G (53687091200 bytes)
>>> disk size: 73G
>>> cluster_size: 65536
>>> Snapshot list:
>>> ID TAG VM SIZE DATE VM CLOCK
>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>> 3099:35:55.242
>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>> 3431:52:23.942 Format specific information:
>>> compat: 1.1
>>> lazy refcounts: false
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> Snapshot list:
>>> ID TAG VM SIZE DATE VM CLOCK
>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>> 3099:35:55.242
>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>> 3431:52:23.942
>>>
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ---------------------------------------------------------------
>>>
>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>>>
>>> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>>>
>>> Thanks
>>> Sean

Re: Snapshots on KVM corrupting disk images

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.

Yes, only after the VM shutdown, the image is corrupted.

пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:

> Hello,
>
> We are using NFS storage.  It is actually native NFS mounts on a NetApp
> storage system.  We haven't seen those log entries, but we also don't
> always know when a VM gets corrupted...  When we finally get a call that a
> VM is having issues, we've found that it was corrupted a while ago.
>
>
> -----Original Message-----
> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
> Sent: Sunday, January 27, 2019 1:45 PM
> To: users@cloudstack.apache.org
> Cc: dev@cloudstack.apache.org
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing
> during the last 5-6 years of using ACS with KVM hosts (see this thread, if
> you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine is
> a bit risky. I've implemented some workarounds in my environment, but I'm
> still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do
> you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible
> memory allocation deadlock size 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552
> in kmem_realloc (mode:0x250) Did you see any unusual messages in your
> log-file when the disaster happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on a
> lot of large number VMs and secondary storage filled up. We had to restore
> all those VM disks... But believed it was just our fault with letting
> secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > -----------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> > ----------------------------------------------------------------------
> > --------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942 Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6
> 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ---------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

Re: Snapshots on KVM corrupting disk images

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.

Yes, only after the VM shutdown, the image is corrupted.

пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:

> Hello,
>
> We are using NFS storage.  It is actually native NFS mounts on a NetApp
> storage system.  We haven't seen those log entries, but we also don't
> always know when a VM gets corrupted...  When we finally get a call that a
> VM is having issues, we've found that it was corrupted a while ago.
>
>
> -----Original Message-----
> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
> Sent: Sunday, January 27, 2019 1:45 PM
> To: users@cloudstack.apache.org
> Cc: dev@cloudstack.apache.org
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing
> during the last 5-6 years of using ACS with KVM hosts (see this thread, if
> you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine is
> a bit risky. I've implemented some workarounds in my environment, but I'm
> still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do
> you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible
> memory allocation deadlock size 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552
> in kmem_realloc (mode:0x250) Did you see any unusual messages in your
> log-file when the disaster happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on a
> lot of large number VMs and secondary storage filled up. We had to restore
> all those VM disks... But believed it was just our fault with letting
> secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > -----------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> > ----------------------------------------------------------------------
> > --------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942 Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6
> 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ---------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Hello,

We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.


-----Original Message-----
From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID] 
Sent: Sunday, January 27, 2019 1:45 PM
To: users@cloudstack.apache.org
Cc: dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Hello Sean,

It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).

I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.

I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?

I hope, things will be well. Wish you good luck and all the best!


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>
> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ------------------------------------------------
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> -----------------------------------------------------------
>
> We tried restoring to before the snapshot failure, but still have strange errors:
>
> ----------------------------------------------------------------------
> --------------
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> file format: qcow2
> virtual size: 50G (53687091200 bytes)
> disk size: 73G
> cluster_size: 65536
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 
> 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 
> 3431:52:23.942 Format specific information:
> compat: 1.1
> lazy refcounts: false
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 
> 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 
> 3431:52:23.942
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------------------------------------
>
> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>
> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>
> Thanks
> Sean

Re: Snapshots on KVM corrupting disk images

Posted by Vladimir Melnik <v....@uplink.ua>.

Dear colleagues,

Yes, that was my PR and my pull-request.

Now I would be very grateful for some kind of help from you.

Please, be so kind as to describe your cases here: https://github.com/apache/cloudstack/pull/3194

Thank you so much!

On Fri, Mar 01, 2019 at 02:00:05PM -0500, Ivan Kudryavtsev wrote:
> Hi, Sean,
> I saw the PR https://github.com/apache/cloudstack/pull/3194
> which seems covers one of the bugs. Haven't had enough time to dive into
> the code to do a review for snapshot-related workflows, but looks like this
> PR does the right thing. Hope it will be added to 4.11.3.
> 
> чт, 28 февр. 2019 г. в 17:02, Sean Lair <sl...@ippathways.com>:
> 
> > Hi Ivan, I wanted to respond here and see if you published a PR yet on
> > this.
> >
> > This is a very scary issue for us as customer can snapshot their volumes
> > and end up causing corruption - and they blame us.  It's already happened -
> > luckily we had Storage Array level snapshots in place as a safety net...
> >
> > Thanks!!
> > Sean
> >
> > -----Original Message-----
> > From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com]
> > Sent: Sunday, January 27, 2019 7:29 PM
> > To: users <us...@cloudstack.apache.org>; cloudstack-fan <
> > cloudstack-fan@protonmail.com>
> > Cc: dev <de...@cloudstack.apache.org>
> > Subject: Re: Snapshots on KVM corrupting disk images
> >
> > Well, guys. I dived into CS agent scripts, which make volume snapshots and
> > found there are no code for suspend/resume and also no code for qemu-agent
> > call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try
> > to add it in nearest days. If tests go well, I'll publish the PR, which I
> > suppose could be integrated into 4.11.3.
> >
> > пн, 28 янв. 2019 г., 2:45 cloudstack-fan
> > cloudstack-fan@protonmail.com.invalid:
> >
> > > Hello Sean,
> > >
> > > It seems that you've encountered the same issue that I've been facing
> > > during the last 5-6 years of using ACS with KVM hosts (see this
> > > thread, if you're interested in additional details:
> > > https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> > > /browser
> > > ).
> > >
> > > I'd like to state that creating snapshots of a running virtual machine
> > > is a bit risky. I've implemented some workarounds in my environment,
> > > but I'm still not sure that they are 100% effective.
> > >
> > > I have a couple of questions, if you don't mind. What kind of storage
> > > do you use, if it's not a secret? Does you storage use XFS as a
> > filesystem?
> > > Did you see something like this in your log-files?
> > > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > > 65552 in kmem_realloc (mode:0x250)
> > > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > > 65552 in kmem_realloc (mode:0x250)
> > > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > > 65552 in kmem_realloc (mode:0x250)
> > > Did you see any unusual messages in your log-file when the disaster
> > > happened?
> > >
> > > I hope, things will be well. Wish you good luck and all the best!
> > >
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > We had some instances where VM disks are becoming corrupted when
> > > > using
> > > KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > > >
> > > > The first time was when someone mass-enabled scheduled snapshots on
> > > > a
> > > lot of large number VMs and secondary storage filled up. We had to
> > > restore all those VM disks... But believed it was just our fault with
> > > letting secondary storage fill up.
> > > >
> > > > Today we had an instance where a snapshot failed and now the disk
> > > > image
> > > is corrupted and the VM can't boot. here is the output of some commands:
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ------------------------------------------------
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > check
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > > Could
> > > not read snapshots: File too large
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > info
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > > Could
> > > not read snapshots: File too large
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > -----------------------------------------------------------
> > > >
> > > > We tried restoring to before the snapshot failure, but still have
> > > strange errors:
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > --------------
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > info
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > file format: qcow2
> > > > virtual size: 50G (53687091200 bytes) disk size: 73G
> > > > cluster_size: 65536
> > > > Snapshot list:
> > > > ID TAG VM SIZE DATE VM CLOCK
> > > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > > 3099:35:55.242
> > > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > > 3431:52:23.942
> > > > Format specific information:
> > > > compat: 1.1
> > > > lazy refcounts: false
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > check
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc
> > > 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db
> > > 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > > > No errors were found on the image.
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > Snapshot list:
> > > > ID TAG VM SIZE DATE VM CLOCK
> > > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > > 3099:35:55.242
> > > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > > 3431:52:23.942
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ---------------------------------------------------------------
> > > >
> > > > Everyone is now extremely hesitant to use snapshots in KVM.... We
> > > > tried
> > > deleting the snapshots in the restored disk image, but it errors out...
> > > >
> > > > Does anyone else have issues with KVM snapshots? We are considering
> > > > just
> > > disabling this functionality now...
> > > >
> > > > Thanks
> > > > Sean
> > >
> > >
> > >
> >
> 
> 
> -- 
> With best regards, Ivan Kudryavtsev
> Bitworks LLC
> Cell RU: +7-923-414-1515
> Cell USA: +1-201-257-1512
> WWW: http://bitworks.software/ <http://bw-sw.com/>

-- 
V.Melnik

Re: Snapshots on KVM corrupting disk images

Posted by Vladimir Melnik <v....@uplink.ua>.

Dear colleagues,

Yes, that was my PR and my pull-request.

Now I would be very grateful for some kind of help from you.

Please, be so kind as to describe your cases here: https://github.com/apache/cloudstack/pull/3194

Thank you so much!

On Fri, Mar 01, 2019 at 02:00:05PM -0500, Ivan Kudryavtsev wrote:
> Hi, Sean,
> I saw the PR https://github.com/apache/cloudstack/pull/3194
> which seems covers one of the bugs. Haven't had enough time to dive into
> the code to do a review for snapshot-related workflows, but looks like this
> PR does the right thing. Hope it will be added to 4.11.3.
> 
> чт, 28 февр. 2019 г. в 17:02, Sean Lair <sl...@ippathways.com>:
> 
> > Hi Ivan, I wanted to respond here and see if you published a PR yet on
> > this.
> >
> > This is a very scary issue for us as customer can snapshot their volumes
> > and end up causing corruption - and they blame us.  It's already happened -
> > luckily we had Storage Array level snapshots in place as a safety net...
> >
> > Thanks!!
> > Sean
> >
> > -----Original Message-----
> > From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com]
> > Sent: Sunday, January 27, 2019 7:29 PM
> > To: users <us...@cloudstack.apache.org>; cloudstack-fan <
> > cloudstack-fan@protonmail.com>
> > Cc: dev <de...@cloudstack.apache.org>
> > Subject: Re: Snapshots on KVM corrupting disk images
> >
> > Well, guys. I dived into CS agent scripts, which make volume snapshots and
> > found there are no code for suspend/resume and also no code for qemu-agent
> > call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try
> > to add it in nearest days. If tests go well, I'll publish the PR, which I
> > suppose could be integrated into 4.11.3.
> >
> > пн, 28 янв. 2019 г., 2:45 cloudstack-fan
> > cloudstack-fan@protonmail.com.invalid:
> >
> > > Hello Sean,
> > >
> > > It seems that you've encountered the same issue that I've been facing
> > > during the last 5-6 years of using ACS with KVM hosts (see this
> > > thread, if you're interested in additional details:
> > > https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> > > /browser
> > > ).
> > >
> > > I'd like to state that creating snapshots of a running virtual machine
> > > is a bit risky. I've implemented some workarounds in my environment,
> > > but I'm still not sure that they are 100% effective.
> > >
> > > I have a couple of questions, if you don't mind. What kind of storage
> > > do you use, if it's not a secret? Does you storage use XFS as a
> > filesystem?
> > > Did you see something like this in your log-files?
> > > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > > 65552 in kmem_realloc (mode:0x250)
> > > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > > 65552 in kmem_realloc (mode:0x250)
> > > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > > 65552 in kmem_realloc (mode:0x250)
> > > Did you see any unusual messages in your log-file when the disaster
> > > happened?
> > >
> > > I hope, things will be well. Wish you good luck and all the best!
> > >
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > We had some instances where VM disks are becoming corrupted when
> > > > using
> > > KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > > >
> > > > The first time was when someone mass-enabled scheduled snapshots on
> > > > a
> > > lot of large number VMs and secondary storage filled up. We had to
> > > restore all those VM disks... But believed it was just our fault with
> > > letting secondary storage fill up.
> > > >
> > > > Today we had an instance where a snapshot failed and now the disk
> > > > image
> > > is corrupted and the VM can't boot. here is the output of some commands:
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ------------------------------------------------
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > check
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > > Could
> > > not read snapshots: File too large
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > info
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > > Could
> > > not read snapshots: File too large
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > -----------------------------------------------------------
> > > >
> > > > We tried restoring to before the snapshot failure, but still have
> > > strange errors:
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > --------------
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > info
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > file format: qcow2
> > > > virtual size: 50G (53687091200 bytes) disk size: 73G
> > > > cluster_size: 65536
> > > > Snapshot list:
> > > > ID TAG VM SIZE DATE VM CLOCK
> > > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > > 3099:35:55.242
> > > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > > 3431:52:23.942
> > > > Format specific information:
> > > > compat: 1.1
> > > > lazy refcounts: false
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > > check
> > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc
> > > 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db
> > > 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > > > No errors were found on the image.
> > > >
> > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > > Snapshot list:
> > > > ID TAG VM SIZE DATE VM CLOCK
> > > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > > 3099:35:55.242
> > > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > > 3431:52:23.942
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > ---------------------------------------------------------------
> > > >
> > > > Everyone is now extremely hesitant to use snapshots in KVM.... We
> > > > tried
> > > deleting the snapshots in the restored disk image, but it errors out...
> > > >
> > > > Does anyone else have issues with KVM snapshots? We are considering
> > > > just
> > > disabling this functionality now...
> > > >
> > > > Thanks
> > > > Sean
> > >
> > >
> > >
> >
> 
> 
> -- 
> With best regards, Ivan Kudryavtsev
> Bitworks LLC
> Cell RU: +7-923-414-1515
> Cell USA: +1-201-257-1512
> WWW: http://bitworks.software/ <http://bw-sw.com/>

-- 
V.Melnik

Re: Snapshots on KVM corrupting disk images

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.

Hi, Sean,
I saw the PR https://github.com/apache/cloudstack/pull/3194
which seems covers one of the bugs. Haven't had enough time to dive into
the code to do a review for snapshot-related workflows, but looks like this
PR does the right thing. Hope it will be added to 4.11.3.

чт, 28 февр. 2019 г. в 17:02, Sean Lair <sl...@ippathways.com>:

> Hi Ivan, I wanted to respond here and see if you published a PR yet on
> this.
>
> This is a very scary issue for us as customer can snapshot their volumes
> and end up causing corruption - and they blame us.  It's already happened -
> luckily we had Storage Array level snapshots in place as a safety net...
>
> Thanks!!
> Sean
>
> -----Original Message-----
> From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com]
> Sent: Sunday, January 27, 2019 7:29 PM
> To: users <us...@cloudstack.apache.org>; cloudstack-fan <
> cloudstack-fan@protonmail.com>
> Cc: dev <de...@cloudstack.apache.org>
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Well, guys. I dived into CS agent scripts, which make volume snapshots and
> found there are no code for suspend/resume and also no code for qemu-agent
> call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try
> to add it in nearest days. If tests go well, I'll publish the PR, which I
> suppose could be integrated into 4.11.3.
>
> пн, 28 янв. 2019 г., 2:45 cloudstack-fan
> cloudstack-fan@protonmail.com.invalid:
>
> > Hello Sean,
> >
> > It seems that you've encountered the same issue that I've been facing
> > during the last 5-6 years of using ACS with KVM hosts (see this
> > thread, if you're interested in additional details:
> > https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> > /browser
> > ).
> >
> > I'd like to state that creating snapshots of a running virtual machine
> > is a bit risky. I've implemented some workarounds in my environment,
> > but I'm still not sure that they are 100% effective.
> >
> > I have a couple of questions, if you don't mind. What kind of storage
> > do you use, if it's not a secret? Does you storage use XFS as a
> filesystem?
> > Did you see something like this in your log-files?
> > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > 65552 in kmem_realloc (mode:0x250)
> > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > 65552 in kmem_realloc (mode:0x250)
> > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > 65552 in kmem_realloc (mode:0x250)
> > Did you see any unusual messages in your log-file when the disaster
> > happened?
> >
> > I hope, things will be well. Wish you good luck and all the best!
> >
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com>
> wrote:
> >
> > > Hi all,
> > >
> > > We had some instances where VM disks are becoming corrupted when
> > > using
> > KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > >
> > > The first time was when someone mass-enabled scheduled snapshots on
> > > a
> > lot of large number VMs and secondary storage filled up. We had to
> > restore all those VM disks... But believed it was just our fault with
> > letting secondary storage fill up.
> > >
> > > Today we had an instance where a snapshot failed and now the disk
> > > image
> > is corrupted and the VM can't boot. here is the output of some commands:
> > >
> > >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ------------------------------------------------
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > Could
> > not read snapshots: File too large
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > Could
> > not read snapshots: File too large
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > >
> > >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > -----------------------------------------------------------
> > >
> > > We tried restoring to before the snapshot failure, but still have
> > strange errors:
> > >
> > >
> > ----------------------------------------------------------------------
> > --------------
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > file format: qcow2
> > > virtual size: 50G (53687091200 bytes) disk size: 73G
> > > cluster_size: 65536
> > > Snapshot list:
> > > ID TAG VM SIZE DATE VM CLOCK
> > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> > > Format specific information:
> > > compat: 1.1
> > > lazy refcounts: false
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc
> > 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db
> > 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > > No errors were found on the image.
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > Snapshot list:
> > > ID TAG VM SIZE DATE VM CLOCK
> > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> > >
> > >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ---------------------------------------------------------------
> > >
> > > Everyone is now extremely hesitant to use snapshots in KVM.... We
> > > tried
> > deleting the snapshots in the restored disk image, but it errors out...
> > >
> > > Does anyone else have issues with KVM snapshots? We are considering
> > > just
> > disabling this functionality now...
> > >
> > > Thanks
> > > Sean
> >
> >
> >
>


-- 
With best regards, Ivan Kudryavtsev
Bitworks LLC
Cell RU: +7-923-414-1515
Cell USA: +1-201-257-1512
WWW: http://bitworks.software/ <http://bw-sw.com/>

Re: Snapshots on KVM corrupting disk images

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.

Hi, Sean,
I saw the PR https://github.com/apache/cloudstack/pull/3194
which seems covers one of the bugs. Haven't had enough time to dive into
the code to do a review for snapshot-related workflows, but looks like this
PR does the right thing. Hope it will be added to 4.11.3.

чт, 28 февр. 2019 г. в 17:02, Sean Lair <sl...@ippathways.com>:

> Hi Ivan, I wanted to respond here and see if you published a PR yet on
> this.
>
> This is a very scary issue for us as customer can snapshot their volumes
> and end up causing corruption - and they blame us.  It's already happened -
> luckily we had Storage Array level snapshots in place as a safety net...
>
> Thanks!!
> Sean
>
> -----Original Message-----
> From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com]
> Sent: Sunday, January 27, 2019 7:29 PM
> To: users <us...@cloudstack.apache.org>; cloudstack-fan <
> cloudstack-fan@protonmail.com>
> Cc: dev <de...@cloudstack.apache.org>
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Well, guys. I dived into CS agent scripts, which make volume snapshots and
> found there are no code for suspend/resume and also no code for qemu-agent
> call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try
> to add it in nearest days. If tests go well, I'll publish the PR, which I
> suppose could be integrated into 4.11.3.
>
> пн, 28 янв. 2019 г., 2:45 cloudstack-fan
> cloudstack-fan@protonmail.com.invalid:
>
> > Hello Sean,
> >
> > It seems that you've encountered the same issue that I've been facing
> > during the last 5-6 years of using ACS with KVM hosts (see this
> > thread, if you're interested in additional details:
> > https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> > /browser
> > ).
> >
> > I'd like to state that creating snapshots of a running virtual machine
> > is a bit risky. I've implemented some workarounds in my environment,
> > but I'm still not sure that they are 100% effective.
> >
> > I have a couple of questions, if you don't mind. What kind of storage
> > do you use, if it's not a secret? Does you storage use XFS as a
> filesystem?
> > Did you see something like this in your log-files?
> > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > 65552 in kmem_realloc (mode:0x250)
> > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > 65552 in kmem_realloc (mode:0x250)
> > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> > 65552 in kmem_realloc (mode:0x250)
> > Did you see any unusual messages in your log-file when the disaster
> > happened?
> >
> > I hope, things will be well. Wish you good luck and all the best!
> >
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com>
> wrote:
> >
> > > Hi all,
> > >
> > > We had some instances where VM disks are becoming corrupted when
> > > using
> > KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > >
> > > The first time was when someone mass-enabled scheduled snapshots on
> > > a
> > lot of large number VMs and secondary storage filled up. We had to
> > restore all those VM disks... But believed it was just our fault with
> > letting secondary storage fill up.
> > >
> > > Today we had an instance where a snapshot failed and now the disk
> > > image
> > is corrupted and the VM can't boot. here is the output of some commands:
> > >
> > >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ------------------------------------------------
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > Could
> > not read snapshots: File too large
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > > Could
> > not read snapshots: File too large
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > >
> > >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > -----------------------------------------------------------
> > >
> > > We tried restoring to before the snapshot failure, but still have
> > strange errors:
> > >
> > >
> > ----------------------------------------------------------------------
> > --------------
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > file format: qcow2
> > > virtual size: 50G (53687091200 bytes) disk size: 73G
> > > cluster_size: 65536
> > > Snapshot list:
> > > ID TAG VM SIZE DATE VM CLOCK
> > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> > > Format specific information:
> > > compat: 1.1
> > > lazy refcounts: false
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > > check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc
> > 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db
> > 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > > No errors were found on the image.
> > >
> > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > > Snapshot list:
> > > ID TAG VM SIZE DATE VM CLOCK
> > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> > >
> > >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > ---------------------------------------------------------------
> > >
> > > Everyone is now extremely hesitant to use snapshots in KVM.... We
> > > tried
> > deleting the snapshots in the restored disk image, but it errors out...
> > >
> > > Does anyone else have issues with KVM snapshots? We are considering
> > > just
> > disabling this functionality now...
> > >
> > > Thanks
> > > Sean
> >
> >
> >
>


-- 
With best regards, Ivan Kudryavtsev
Bitworks LLC
Cell RU: +7-923-414-1515
Cell USA: +1-201-257-1512
WWW: http://bitworks.software/ <http://bw-sw.com/>

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Hi Ivan, I wanted to respond here and see if you published a PR yet on this.

This is a very scary issue for us as customer can snapshot their volumes and end up causing corruption - and they blame us.  It's already happened - luckily we had Storage Array level snapshots in place as a safety net...

Thanks!!
Sean

-----Original Message-----
From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com] 
Sent: Sunday, January 27, 2019 7:29 PM
To: users <us...@cloudstack.apache.org>; cloudstack-fan <cl...@protonmail.com>
Cc: dev <de...@cloudstack.apache.org>
Subject: Re: Snapshots on KVM corrupting disk images

Well, guys. I dived into CS agent scripts, which make volume snapshots and found there are no code for suspend/resume and also no code for qemu-agent call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try to add it in nearest days. If tests go well, I'll publish the PR, which I suppose could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-fan@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing 
> during the last 5-6 years of using ACS with KVM hosts (see this 
> thread, if you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> /browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine 
> is a bit risky. I've implemented some workarounds in my environment, 
> but I'm still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage 
> do you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster 
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when 
> > using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on 
> > a
> lot of large number VMs and secondary storage filled up. We had to 
> restore all those VM disks... But believed it was just our fault with 
> letting secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk 
> > image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> -----------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> >
> ----------------------------------------------------------------------
> --------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes) disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> > Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 
> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We 
> > tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering 
> > just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Hi Ivan, I wanted to respond here and see if you published a PR yet on this.

This is a very scary issue for us as customer can snapshot their volumes and end up causing corruption - and they blame us.  It's already happened - luckily we had Storage Array level snapshots in place as a safety net...

Thanks!!
Sean

-----Original Message-----
From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com] 
Sent: Sunday, January 27, 2019 7:29 PM
To: users <us...@cloudstack.apache.org>; cloudstack-fan <cl...@protonmail.com>
Cc: dev <de...@cloudstack.apache.org>
Subject: Re: Snapshots on KVM corrupting disk images

Well, guys. I dived into CS agent scripts, which make volume snapshots and found there are no code for suspend/resume and also no code for qemu-agent call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try to add it in nearest days. If tests go well, I'll publish the PR, which I suppose could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-fan@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing 
> during the last 5-6 years of using ACS with KVM hosts (see this 
> thread, if you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> /browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine 
> is a bit risky. I've implemented some workarounds in my environment, 
> but I'm still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage 
> do you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster 
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when 
> > using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on 
> > a
> lot of large number VMs and secondary storage filled up. We had to 
> restore all those VM disks... But believed it was just our fault with 
> letting secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk 
> > image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> -----------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> >
> ----------------------------------------------------------------------
> --------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes) disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> > Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 
> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We 
> > tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering 
> > just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Sounds good, I think something needs to be done.  Very scary that users can corrupt their VMs if they are doing volume snapshots


-----Original Message-----
From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com] 
Sent: Sunday, January 27, 2019 7:29 PM
To: users <us...@cloudstack.apache.org>; cloudstack-fan <cl...@protonmail.com>
Cc: dev <de...@cloudstack.apache.org>
Subject: Re: Snapshots on KVM corrupting disk images

Well, guys. I dived into CS agent scripts, which make volume snapshots and found there are no code for suspend/resume and also no code for qemu-agent call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try to add it in nearest days. If tests go well, I'll publish the PR, which I suppose could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-fan@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing 
> during the last 5-6 years of using ACS with KVM hosts (see this 
> thread, if you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> /browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine 
> is a bit risky. I've implemented some workarounds in my environment, 
> but I'm still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage 
> do you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster 
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when 
> > using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on 
> > a
> lot of large number VMs and secondary storage filled up. We had to 
> restore all those VM disks... But believed it was just our fault with 
> letting secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk 
> > image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> -----------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> >
> ----------------------------------------------------------------------
> --------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes) disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> > Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 
> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We 
> > tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering 
> > just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Sounds good, I think something needs to be done.  Very scary that users can corrupt their VMs if they are doing volume snapshots


-----Original Message-----
From: Ivan Kudryavtsev [mailto:kudryavtsev_ia@bw-sw.com] 
Sent: Sunday, January 27, 2019 7:29 PM
To: users <us...@cloudstack.apache.org>; cloudstack-fan <cl...@protonmail.com>
Cc: dev <de...@cloudstack.apache.org>
Subject: Re: Snapshots on KVM corrupting disk images

Well, guys. I dived into CS agent scripts, which make volume snapshots and found there are no code for suspend/resume and also no code for qemu-agent call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try to add it in nearest days. If tests go well, I'll publish the PR, which I suppose could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-fan@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing 
> during the last 5-6 years of using ACS with KVM hosts (see this 
> thread, if you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> /browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine 
> is a bit risky. I've implemented some workarounds in my environment, 
> but I'm still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage 
> do you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster 
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when 
> > using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on 
> > a
> lot of large number VMs and secondary storage filled up. We had to 
> restore all those VM disks... But believed it was just our fault with 
> letting secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk 
> > image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> -----------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> >
> ----------------------------------------------------------------------
> --------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes) disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> > Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 
> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> > No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> >
> >
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We 
> > tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering 
> > just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

Re: Snapshots on KVM corrupting disk images

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.

Well, guys. I dived into CS agent scripts, which make volume snapshots and
found there are no code for suspend/resume and also no code for qemu-agent
call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try
to add it in nearest days. If tests go well, I'll publish the PR, which I
suppose could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-fan@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing
> during the last 5-6 years of using ACS with KVM hosts (see this thread, if
> you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine is
> a bit risky. I've implemented some workarounds in my environment, but I'm
> still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do
> you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on a
> lot of large number VMs and secondary storage filled up. We had to restore
> all those VM disks... But believed it was just our fault with letting
> secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> >
> ------------------------------------------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> > Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541
> 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05
> 0x55d16ddd9f7d
> > No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> >
> >
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

Re: Snapshots on KVM corrupting disk images

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.

Well, guys. I dived into CS agent scripts, which make volume snapshots and
found there are no code for suspend/resume and also no code for qemu-agent
call fsfreeze/fsthaw. I don't see any blockers adding that code yet and try
to add it in nearest days. If tests go well, I'll publish the PR, which I
suppose could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-fan@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing
> during the last 5-6 years of using ACS with KVM hosts (see this thread, if
> you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine is
> a bit risky. I've implemented some workarounds in my environment, but I'm
> still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do
> you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on a
> lot of large number VMs and secondary storage filled up. We had to restore
> all those VM disks... But believed it was just our fault with letting
> secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > We tried restoring to before the snapshot failure, but still have
> strange errors:
> >
> >
> ------------------------------------------------------------------------------------
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> > Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541
> 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05
> 0x55d16ddd9f7d
> > No errors were found on the image.
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> >
> >
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We tried
> deleting the snapshots in the restored disk image, but it errors out...
> >
> > Does anyone else have issues with KVM snapshots? We are considering just
> disabling this functionality now...
> >
> > Thanks
> > Sean
>
>
>

RE: Snapshots on KVM corrupting disk images

Posted by Sean Lair <sl...@ippathways.com>.

Hello,

We are using NFS storage.  It is actually native NFS mounts on a NetApp storage system.  We haven't seen those log entries, but we also don't always know when a VM gets corrupted...  When we finally get a call that a VM is having issues, we've found that it was corrupted a while ago.


-----Original Message-----
From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID] 
Sent: Sunday, January 27, 2019 1:45 PM
To: users@cloudstack.apache.org
Cc: dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Hello Sean,

It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).

I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.

I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file when the disaster happened?

I hope, things will be well. Wish you good luck and all the best!


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>
> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ------------------------------------------------
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> -----------------------------------------------------------
>
> We tried restoring to before the snapshot failure, but still have strange errors:
>
> ----------------------------------------------------------------------
> --------------
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> file format: qcow2
> virtual size: 50G (53687091200 bytes)
> disk size: 73G
> cluster_size: 65536
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 
> 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 
> 3431:52:23.942 Format specific information:
> compat: 1.1
> lazy refcounts: false
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the image.
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 
> 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 
> 3431:52:23.942
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------------------------------------
>
> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>
> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>
> Thanks
> Sean

Re: Snapshots on KVM corrupting disk images

Posted by cloudstack-fan <cl...@protonmail.com.INVALID>.

Hello Sean,

It seems that you've encountered the same issue that I've been facing during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).

I'd like to state that creating snapshots of a running virtual machine is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure that they are 100% effective.

I have a couple of questions, if you don't mind. What kind of storage do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this in your log-files?
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250)
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250)
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc (mode:0x250)
Did you see any unusual messages in your log-file when the disaster happened?

I hope, things will be well. Wish you good luck and all the best!


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a lot of large number VMs and secondary storage filled up. We had to restore all those VM disks... But believed it was just our fault with letting secondary storage fill up.
>
> Today we had an instance where a snapshot failed and now the disk image is corrupted and the VM can't boot. here is the output of some commands:
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> We tried restoring to before the snapshot failure, but still have strange errors:
>
> ------------------------------------------------------------------------------------
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> file format: qcow2
> virtual size: 50G (53687091200 bytes)
> disk size: 73G
> cluster_size: 65536
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 3431:52:23.942
> Format specific information:
> compat: 1.1
> lazy refcounts: false
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
> No errors were found on the image.
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 3431:52:23.942
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting the snapshots in the restored disk image, but it errors out...
>
> Does anyone else have issues with KVM snapshots? We are considering just disabling this functionality now...
>
> Thanks
> Sean