You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cloudstack.apache.org by cloudstack-fan <cl...@protonmail.com.INVALID> on 2018/08/18 11:06:08 UTC

Re: qemu2 images are being corrupted

Dear colleagues,

You might find it interesting:
https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/

It seems that qemu-kvm really could corrupt a QCOW2 image. :-(

What do you think, is that possible to avoid that? Maybe there's an option to use RAW forman instead of QCOW2?

Thanks!

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On 2 July 2018 12:21 PM, cloudstack-fan <cl...@protonmail.com> wrote:

> Dear colleagues,
>
> I'm posting as an anonymous user, because there's a thing that concerns me a little and I'd like to share my experience with you, so maybe some people could relate to the same. ACS is amazing, it solves my tasks for 6 years, I'm running a few ACS-backed clouds that contain hundreds and hundreds of VMs. I'm enjoying ACS really much, but there's a thing that scares me sometimes.
>
> It happens pretty seldom, but the more VMs you have is the more chances you run into this glitch. It usually happens on the sly and you don't get any error messages in log-files of your cloudstack-management server or a cloudstack-agent, so you don't even know that something had happened until you see that a virtual machine is having major problems. If you're lucky, you see it on the same day when it happens, but if you aren't - you won't suspect anything unusual for a week, but at some moment you realize that the filesystem had become a mess and you can't do anything to restore it. You're trying to restore it from a snapshot, but if you don't have a snapshot that would be created before the incident, your snapshots won't help. :-(
>
> I experienced it for about 5-7 times during the last 5-6 years and there are a few conditions that always present:
>  * it happens on KVM-based hosts (I experienced itt with CentOS 6 and CentOS 7) with qcow2-images (either 0.10 and 1.1 versions);
>  * it happens on primary storages running different filesystems (I experiences it with local XFS and network-based GFS2 and NFS);
>  * it happens when a volume snapshot is being made, according to the log-files inside of a VM (guest's operating system's kernel starts complaining on a filesystem errors);
>  * at the same time, as I wrote before, there are NO error messages in the log-files outside of a VM which disk image is corrupted;
>  * but when you run `qemu-img check ...` to check the image, you may see a lot of leaked clusters (that's why I'd strongly advice to check each and every image one each and every primary storage at least once per hour by a script being run by your monitoring system, something kind of `for imagefile in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do { /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } fi; } done`);
>  * when it happens you can also find a record in the snapshot_store_ref table that refers to the snapshot on a primary storage (see an example here https://pastebin.com/BuxCXVSq) - this record should have been removed when the snapshot's state is being changed from "BackingUp" to "BackedUp", but it isn't being removed in this case. At the same time, this snapshot isn't being listed in the output of `qemu-img snapshot -l ...`, so that's why I suppose that the image is being corrupted when ACS deletes the snapshot that has been backed up (it tries to delete the snapshot, but something goes wrong, image is being corrupted, but ACS thinks that everything's fine and changes the status to "BackedUp" without a bit of qualm);
>  * if you're trying to restore this VM's image from the same snapshot that has caused destruction or any other snapshot that has been made after that, you'll find the same corrupted filesystem inside, but the snapshot's image that is stored in your secondary storage doesn't show anything wrong when you run `qemu-img check ...` (so you can restore your image only if you have a snapshot that had been created AND stored before the incident).
>
> As I wrote, I saw several times in different environments and different versions of ACS. I'm pretty sure that it's not only me who had such a luck to experience the same glitch, so let's share our stories. Maybe together we'll find out why does it happen and how to prevent that in future.
>
> Thanks in advance,
> An Anonymous ACS Fan

RE: qemu2 images are being corrupted

Posted by Nicolas Bouige <n....@dimsi.fr>.

Hi All,

Maybe this is not related but that's seem know qemu corrup .qcow2 image with internal snapshot

https://www.linux-kvm.org/images/6/65/02x08B-Max_Reitz-Backups_with_QEMU.pdf (slide 13/15)

Nicolas Bouige
DIMSI
cloud.dimsi.fr<http://www.cloud.dimsi.fr>
4, avenue Laurent Cely
Tour d’Asnière – 92600 Asnière sur Seine
T/ +33 (0)6 28 98 53 40


________________________________
De : cloudstack-fan <cl...@protonmail.com.INVALID>
Envoyé : samedi 18 août 2018 13:06:08
À : users@cloudstack.apache.org
Objet : Re: qemu2 images are being corrupted

Dear colleagues,

You might find it interesting:
https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/

It seems that qemu-kvm really could corrupt a QCOW2 image. :-(

What do you think, is that possible to avoid that? Maybe there's an option to use RAW forman instead of QCOW2?

Thanks!

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On 2 July 2018 12:21 PM, cloudstack-fan <cl...@protonmail.com> wrote:

> Dear colleagues,
>
> I'm posting as an anonymous user, because there's a thing that concerns me a little and I'd like to share my experience with you, so maybe some people could relate to the same. ACS is amazing, it solves my tasks for 6 years, I'm running a few ACS-backed clouds that contain hundreds and hundreds of VMs. I'm enjoying ACS really much, but there's a thing that scares me sometimes.
>
> It happens pretty seldom, but the more VMs you have is the more chances you run into this glitch. It usually happens on the sly and you don't get any error messages in log-files of your cloudstack-management server or a cloudstack-agent, so you don't even know that something had happened until you see that a virtual machine is having major problems. If you're lucky, you see it on the same day when it happens, but if you aren't - you won't suspect anything unusual for a week, but at some moment you realize that the filesystem had become a mess and you can't do anything to restore it. You're trying to restore it from a snapshot, but if you don't have a snapshot that would be created before the incident, your snapshots won't help. :-(
>
> I experienced it for about 5-7 times during the last 5-6 years and there are a few conditions that always present:
>  * it happens on KVM-based hosts (I experienced itt with CentOS 6 and CentOS 7) with qcow2-images (either 0.10 and 1.1 versions);
>  * it happens on primary storages running different filesystems (I experiences it with local XFS and network-based GFS2 and NFS);
>  * it happens when a volume snapshot is being made, according to the log-files inside of a VM (guest's operating system's kernel starts complaining on a filesystem errors);
>  * at the same time, as I wrote before, there are NO error messages in the log-files outside of a VM which disk image is corrupted;
>  * but when you run `qemu-img check ...` to check the image, you may see a lot of leaked clusters (that's why I'd strongly advice to check each and every image one each and every primary storage at least once per hour by a script being run by your monitoring system, something kind of `for imagefile in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do { /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } fi; } done`);
>  * when it happens you can also find a record in the snapshot_store_ref table that refers to the snapshot on a primary storage (see an example here https://pastebin.com/BuxCXVSq) - this record should have been removed when the snapshot's state is being changed from "BackingUp" to "BackedUp", but it isn't being removed in this case. At the same time, this snapshot isn't being listed in the output of `qemu-img snapshot -l ...`, so that's why I suppose that the image is being corrupted when ACS deletes the snapshot that has been backed up (it tries to delete the snapshot, but something goes wrong, image is being corrupted, but ACS thinks that everything's fine and changes the status to "BackedUp" without a bit of qualm);
>  * if you're trying to restore this VM's image from the same snapshot that has caused destruction or any other snapshot that has been made after that, you'll find the same corrupted filesystem inside, but the snapshot's image that is stored in your secondary storage doesn't show anything wrong when you run `qemu-img check ...` (so you can restore your image only if you have a snapshot that had been created AND stored before the incident).
>
> As I wrote, I saw several times in different environments and different versions of ACS. I'm pretty sure that it's not only me who had such a luck to experience the same glitch, so let's share our stories. Maybe together we'll find out why does it happen and how to prevent that in future.
>
> Thanks in advance,
> An Anonymous ACS Fan