You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Jayanth Reddy <ja...@gmail.com> on 2023/10/31 05:07:40 UTC

Strange issue with Cloudstack and Ceph

Hello Users,

We have the environment as below:
Cloudstack version: v4.16.1.0
Management and hypervisors: Ubuntu Server 20.04 LTS
Hypervisor: KVM
Primary Storage: Ceph RBD (cluster scoped)

It appears that one of our clusters having 8 hosts is having the issue. We
have HCI on these 8 hosts and there are approximately 700+ VMs running. But
strange enough, there are these logs like below on hosts.

```
Oct 25 13:38:11 hv-01 libvirtd[9464]: failed to open the RBD image
'087bb114-448a-41d2-9f5d-6865b62eed15': No such file or directory
Oct 25 20:35:22 hv-01 libvirtd[9464]: failed to open the RBD image
'ccc1168a-5ffa-4b6d-a953-8e0ac788ebc5': No such file or directory
Oct 26 09:48:33 hv-01 libvirtd[9464]: failed to open the RBD image
'a3fe82f8-afc9-4604-b55e-91b676514a18': No such file or directory
Oct 26 10:38:17 hv-01 libvirtd[9464]: End of file while reading data:
Input/output error
```

We've got DNS servers on which there is an`A` record resolving to all the
IP4 Addresses of 8 monitors and there have not been any issues with the DNS
resolution. But the issue of "failed to open the RBD image
'ccc1168a-5ffa-4b6d-a953-8e0ac788ebc5': No such file or directory" gets
more weird because the VMs that are making use of that RBD image lets say
"087bb114-448a-41d2-9f5d-6865b62eed15" is running on altogether different
host like "hv-06". On further inspection of that specific Virtual Machine,
it has been running on that host "hv-06" for more than 4 months or so
(looked at "Last updated" field). Fortunately, the Virtual Machine has no
issues and has been running since then.

We're noticing the same "failed to open the RBD image" on all the hosts in
that cluster. There are no network issues observed or any issues on the
hosts. I was thinking of getting this host into maintenance but the same is
there on all the hosts in that cluster. I've not gotten a chance to look at
the management server logs but was wondering if someone had gotten into the
same issue.

Thanks,
Jayanth