You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Aaron Hurt <ah...@ena.com> on 2016/06/30 16:29:55 UTC

Ceph RBD related host agent segfault

In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:

lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java

                for (RbdSnapInfo snap : snaps) {
                    if (image.snapIsProtected(snap.name)) {
                        s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
                        image.snapUnprotect(snap.name);
                    } else {
                        s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
                    }
                    s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
                    image.snapRemove(snap.name);
                }

Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?

Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.


To Reproduce (with ceph/rbd primary storage)

1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
3.  Restart management server
4.  Deploy several VMs from templates
5.  Destroy+expunge the VMs after they are running
6.  Observe segfaults in management server


Workaround

We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.

Segfault Logs

https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>

https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>

We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.

— Aaron

Re: Ceph RBD related host agent segfault

Posted by Aaron Hurt <ah...@ena.com>.
> On Jul 8, 2016, at 2:53 AM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
> 
>> Op 7 juli 2016 om 6:35 schreef Aaron Hurt <ah...@ena.com>:
>> 
>> 
>> 
>>> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <wi...@widodh.nl> wrote:
>>> 
>>>> 
>>>> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>> wrote:
>>>>> 
>>>>>> 
>>>>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com> <mailto:ahurt@ena.com <ma...@ena.com>>>:
>>>>>> 
>>>>>> 
>>>>>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
>>>>>> 
>>>>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>>>>>> 
>>>>>>              for (RbdSnapInfo snap : snaps) {
>>>>>>                  if (image.snapIsProtected(snap.name)) {
>>>>>>                      s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>>>                      image.snapUnprotect(snap.name);
>>>>>>                  } else {
>>>>>>                      s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>>>>>>                  }
>>>>>>                  s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>>>                  image.snapRemove(snap.name);
>>>>>>              }
>>>>>> 
>>>>>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
>>>>>> 
>>>>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
>>>>>> 
>>>>>> 
>>>>>> To Reproduce (with ceph/rbd primary storage)
>>>>>> 
>>>>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
>>>>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
>>>>>> 3.  Restart management server
>>>>>> 4.  Deploy several VMs from templates
>>>>>> 5.  Destroy+expunge the VMs after they are running
>>>>>> 6.  Observe segfaults in management server
>>>>>> 
>>>>>> 
>>>>>> Workaround
>>>>>> 
>>>>>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
>>>>>> 
>>>>>> Segfault Logs
>>>>>> 
>>>>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
>>>>>> 
>>>>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
>>>>>> 
>>>>>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
>>>>> 
>>>>> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
>>>>> 
>>>>> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
>>>>> 
>>>>> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
>>>>> 
>>>>> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
>>>>> 
>>>>> Wido
>>>> 
>>>> 
>>>> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
>>>> 
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
>>>> 
>>>> Is there something obvious I’m missing here?
>>>> 
>>> 
>>> I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.
>>> 
>>> Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314>
>>> 
>>> Isn't there a package version mismatch in your Ceph cluster?
>>> 
>>> Wido
>>> 
>> 
>> I agree it’s very confusing and I’m running out of ideas as to what the cause may be.
>> 
>> Here are the package versions on all our related boxes in the lab.
>> 
>> http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g
>> 
>> I’ve also collected the most recent instances of our two segfaults/aborts below.
>> 
>> journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted
>> 
>> http://sprunge.us/LcYA
>> 
>> journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException
>> 
>> http://sprunge.us/SiCf
>> 
> 
> Looking at this I come to the conclusion that this is a librados bug and not rados-java nor CloudStack.
> 
> The crashes are happening with exactly the same backtrace. I have a few clusters running with Hammer 0.94.5 and they all clean up their snapshots just fine, no crashes.
> 
>> I also went back to look at the two places in the code where snapshot cleanup is taking place in our tree:
>> 
>> The place where the failed to unprotect exceptions seem to be triggered:
>> https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>
>> 
>> The cleanup code for rbd snapshots:
>> https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>
>> 
>> I’ve rolled a small patch into our testing tree that moves the context and image cleanup into a ‘finally’ block because I thought there may be a problem with those not being closed/freed when the unprotect threw an exception.  So the code in LibvirtStorageAdapter.java is slightly different than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows the changes.
>> 
>> https://github.com/myENA/cloudstack/pull/11/files
>> 
>> Does this make sense?  Is this even possibly related to the issue I’m seeing?
> 
> The patch looks sane, you can more exceptions, but still, such a true crash of Java/librados can't be triggered easily from Java. I truly think this is a librados bug.
> 
> Wido
> 
>> 
>> Once again, all the advice and attention is definitely appreciated.
>> 
>> — Aaron
>> 

We finally solved the issue.  The exceptions were being thrown due to a malformed/typoed ACL (“reb_children" instead of “rbd_children”) that was causing an “Operation not permitted” exception to be thrown by librbd when attempting to unprotect the base snapshot.  This exception in combination with a failure to cleanup contexts when exceptions occurred within the loop in deletePhysicalDisk eventually caused the host agent to crash.  We fixed the access list and issued a PR to ensure contexts are cleaned up even when exceptions occurred.  In addition to the context cleanup we also added additional logging to help troubleshoot similar issues if/when they happen again in the future.

PR for the fix:

https://github.com/apache/cloudstack/pull/1608 <https://github.com/apache/cloudstack/pull/1608>

Thank you again for all the support.

— Aaron


Re: Ceph RBD related host agent segfault

Posted by Aaron Hurt <ah...@ena.com>.
> On Jul 8, 2016, at 2:53 AM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
> 
>> Op 7 juli 2016 om 6:35 schreef Aaron Hurt <ah...@ena.com>:
>> 
>> 
>> 
>>> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <wi...@widodh.nl> wrote:
>>> 
>>>> 
>>>> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>> wrote:
>>>>> 
>>>>>> 
>>>>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com> <mailto:ahurt@ena.com <ma...@ena.com>>>:
>>>>>> 
>>>>>> 
>>>>>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
>>>>>> 
>>>>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>>>>>> 
>>>>>>              for (RbdSnapInfo snap : snaps) {
>>>>>>                  if (image.snapIsProtected(snap.name)) {
>>>>>>                      s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>>>                      image.snapUnprotect(snap.name);
>>>>>>                  } else {
>>>>>>                      s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>>>>>>                  }
>>>>>>                  s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>>>                  image.snapRemove(snap.name);
>>>>>>              }
>>>>>> 
>>>>>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
>>>>>> 
>>>>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
>>>>>> 
>>>>>> 
>>>>>> To Reproduce (with ceph/rbd primary storage)
>>>>>> 
>>>>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
>>>>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
>>>>>> 3.  Restart management server
>>>>>> 4.  Deploy several VMs from templates
>>>>>> 5.  Destroy+expunge the VMs after they are running
>>>>>> 6.  Observe segfaults in management server
>>>>>> 
>>>>>> 
>>>>>> Workaround
>>>>>> 
>>>>>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
>>>>>> 
>>>>>> Segfault Logs
>>>>>> 
>>>>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
>>>>>> 
>>>>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
>>>>>> 
>>>>>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
>>>>> 
>>>>> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
>>>>> 
>>>>> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
>>>>> 
>>>>> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
>>>>> 
>>>>> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
>>>>> 
>>>>> Wido
>>>> 
>>>> 
>>>> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
>>>> 
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
>>>> 
>>>> Is there something obvious I’m missing here?
>>>> 
>>> 
>>> I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.
>>> 
>>> Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314>
>>> 
>>> Isn't there a package version mismatch in your Ceph cluster?
>>> 
>>> Wido
>>> 
>> 
>> I agree it’s very confusing and I’m running out of ideas as to what the cause may be.
>> 
>> Here are the package versions on all our related boxes in the lab.
>> 
>> http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g
>> 
>> I’ve also collected the most recent instances of our two segfaults/aborts below.
>> 
>> journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted
>> 
>> http://sprunge.us/LcYA
>> 
>> journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException
>> 
>> http://sprunge.us/SiCf
>> 
> 
> Looking at this I come to the conclusion that this is a librados bug and not rados-java nor CloudStack.
> 
> The crashes are happening with exactly the same backtrace. I have a few clusters running with Hammer 0.94.5 and they all clean up their snapshots just fine, no crashes.
> 
>> I also went back to look at the two places in the code where snapshot cleanup is taking place in our tree:
>> 
>> The place where the failed to unprotect exceptions seem to be triggered:
>> https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>
>> 
>> The cleanup code for rbd snapshots:
>> https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>
>> 
>> I’ve rolled a small patch into our testing tree that moves the context and image cleanup into a ‘finally’ block because I thought there may be a problem with those not being closed/freed when the unprotect threw an exception.  So the code in LibvirtStorageAdapter.java is slightly different than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows the changes.
>> 
>> https://github.com/myENA/cloudstack/pull/11/files
>> 
>> Does this make sense?  Is this even possibly related to the issue I’m seeing?
> 
> The patch looks sane, you can more exceptions, but still, such a true crash of Java/librados can't be triggered easily from Java. I truly think this is a librados bug.
> 
> Wido
> 
>> 
>> Once again, all the advice and attention is definitely appreciated.
>> 
>> — Aaron
>> 

We finally solved the issue.  The exceptions were being thrown due to a malformed/typoed ACL (“reb_children" instead of “rbd_children”) that was causing an “Operation not permitted” exception to be thrown by librbd when attempting to unprotect the base snapshot.  This exception in combination with a failure to cleanup contexts when exceptions occurred within the loop in deletePhysicalDisk eventually caused the host agent to crash.  We fixed the access list and issued a PR to ensure contexts are cleaned up even when exceptions occurred.  In addition to the context cleanup we also added additional logging to help troubleshoot similar issues if/when they happen again in the future.

PR for the fix:

https://github.com/apache/cloudstack/pull/1608 <https://github.com/apache/cloudstack/pull/1608>

Thank you again for all the support.

— Aaron


Re: Ceph RBD related host agent segfault

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 7 juli 2016 om 6:35 schreef Aaron Hurt <ah...@ena.com>:
> 
> 
> 
> > On Jul 6, 2016, at 5:14 PM, Wido den Hollander <wi...@widodh.nl> wrote:
> > 
> >> 
> >> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
> >> 
> >> 
> >> 
> >>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>> wrote:
> >>> 
> >>>> 
> >>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com> <mailto:ahurt@ena.com <ma...@ena.com>>>:
> >>>> 
> >>>> 
> >>>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
> >>>> 
> >>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
> >>>> 
> >>>>               for (RbdSnapInfo snap : snaps) {
> >>>>                   if (image.snapIsProtected(snap.name)) {
> >>>>                       s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>>>                       image.snapUnprotect(snap.name);
> >>>>                   } else {
> >>>>                       s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
> >>>>                   }
> >>>>                   s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>>>                   image.snapRemove(snap.name);
> >>>>               }
> >>>> 
> >>>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
> >>>> 
> >>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
> >>>> 
> >>>> 
> >>>> To Reproduce (with ceph/rbd primary storage)
> >>>> 
> >>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
> >>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
> >>>> 3.  Restart management server
> >>>> 4.  Deploy several VMs from templates
> >>>> 5.  Destroy+expunge the VMs after they are running
> >>>> 6.  Observe segfaults in management server
> >>>> 
> >>>> 
> >>>> Workaround
> >>>> 
> >>>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
> >>>> 
> >>>> Segfault Logs
> >>>> 
> >>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
> >>>> 
> >>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
> >>>> 
> >>>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
> >>> 
> >>> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
> >>> 
> >>> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
> >>> 
> >>> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
> >>> 
> >>> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
> >>> 
> >>> Wido
> >> 
> >> 
> >> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
> >> 
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
> >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
> >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
> >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
> >> 
> >> Is there something obvious I’m missing here?
> >> 
> > 
> > I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.
> > 
> > Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314>
> > 
> > Isn't there a package version mismatch in your Ceph cluster?
> > 
> > Wido
> > 
> 
> I agree it’s very confusing and I’m running out of ideas as to what the cause may be.
> 
> Here are the package versions on all our related boxes in the lab.
> 
> http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g
> 
> I’ve also collected the most recent instances of our two segfaults/aborts below.
> 
> journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted
> 
> http://sprunge.us/LcYA
> 
> journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException
> 
> http://sprunge.us/SiCf
> 

Looking at this I come to the conclusion that this is a librados bug and not rados-java nor CloudStack.

The crashes are happening with exactly the same backtrace. I have a few clusters running with Hammer 0.94.5 and they all clean up their snapshots just fine, no crashes.

> I also went back to look at the two places in the code where snapshot cleanup is taking place in our tree:
> 
> The place where the failed to unprotect exceptions seem to be triggered:
> https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>
> 
> The cleanup code for rbd snapshots:
> https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>
> 
> I’ve rolled a small patch into our testing tree that moves the context and image cleanup into a ‘finally’ block because I thought there may be a problem with those not being closed/freed when the unprotect threw an exception.  So the code in LibvirtStorageAdapter.java is slightly different than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows the changes.
> 
> https://github.com/myENA/cloudstack/pull/11/files
> 
> Does this make sense?  Is this even possibly related to the issue I’m seeing?

The patch looks sane, you can more exceptions, but still, such a true crash of Java/librados can't be triggered easily from Java. I truly think this is a librados bug.

Wido

> 
> Once again, all the advice and attention is definitely appreciated.
> 
> — Aaron
>

Re: Ceph RBD related host agent segfault

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 7 juli 2016 om 6:35 schreef Aaron Hurt <ah...@ena.com>:
> 
> 
> 
> > On Jul 6, 2016, at 5:14 PM, Wido den Hollander <wi...@widodh.nl> wrote:
> > 
> >> 
> >> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
> >> 
> >> 
> >> 
> >>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>> wrote:
> >>> 
> >>>> 
> >>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com> <mailto:ahurt@ena.com <ma...@ena.com>>>:
> >>>> 
> >>>> 
> >>>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
> >>>> 
> >>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
> >>>> 
> >>>>               for (RbdSnapInfo snap : snaps) {
> >>>>                   if (image.snapIsProtected(snap.name)) {
> >>>>                       s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>>>                       image.snapUnprotect(snap.name);
> >>>>                   } else {
> >>>>                       s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
> >>>>                   }
> >>>>                   s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>>>                   image.snapRemove(snap.name);
> >>>>               }
> >>>> 
> >>>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
> >>>> 
> >>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
> >>>> 
> >>>> 
> >>>> To Reproduce (with ceph/rbd primary storage)
> >>>> 
> >>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
> >>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
> >>>> 3.  Restart management server
> >>>> 4.  Deploy several VMs from templates
> >>>> 5.  Destroy+expunge the VMs after they are running
> >>>> 6.  Observe segfaults in management server
> >>>> 
> >>>> 
> >>>> Workaround
> >>>> 
> >>>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
> >>>> 
> >>>> Segfault Logs
> >>>> 
> >>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
> >>>> 
> >>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
> >>>> 
> >>>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
> >>> 
> >>> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
> >>> 
> >>> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
> >>> 
> >>> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
> >>> 
> >>> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
> >>> 
> >>> Wido
> >> 
> >> 
> >> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
> >> 
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
> >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
> >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
> >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
> >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
> >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
> >> 
> >> Is there something obvious I’m missing here?
> >> 
> > 
> > I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.
> > 
> > Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314>
> > 
> > Isn't there a package version mismatch in your Ceph cluster?
> > 
> > Wido
> > 
> 
> I agree it’s very confusing and I’m running out of ideas as to what the cause may be.
> 
> Here are the package versions on all our related boxes in the lab.
> 
> http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g
> 
> I’ve also collected the most recent instances of our two segfaults/aborts below.
> 
> journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted
> 
> http://sprunge.us/LcYA
> 
> journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException
> 
> http://sprunge.us/SiCf
> 

Looking at this I come to the conclusion that this is a librados bug and not rados-java nor CloudStack.

The crashes are happening with exactly the same backtrace. I have a few clusters running with Hammer 0.94.5 and they all clean up their snapshots just fine, no crashes.

> I also went back to look at the two places in the code where snapshot cleanup is taking place in our tree:
> 
> The place where the failed to unprotect exceptions seem to be triggered:
> https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>
> 
> The cleanup code for rbd snapshots:
> https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>
> 
> I’ve rolled a small patch into our testing tree that moves the context and image cleanup into a ‘finally’ block because I thought there may be a problem with those not being closed/freed when the unprotect threw an exception.  So the code in LibvirtStorageAdapter.java is slightly different than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows the changes.
> 
> https://github.com/myENA/cloudstack/pull/11/files
> 
> Does this make sense?  Is this even possibly related to the issue I’m seeing?

The patch looks sane, you can more exceptions, but still, such a true crash of Java/librados can't be triggered easily from Java. I truly think this is a librados bug.

Wido

> 
> Once again, all the advice and attention is definitely appreciated.
> 
> — Aaron
>

Re: Ceph RBD related host agent segfault

Posted by Aaron Hurt <ah...@ena.com>.
> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
>> 
>> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
>> 
>> 
>> 
>>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>> wrote:
>>> 
>>>> 
>>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com> <mailto:ahurt@ena.com <ma...@ena.com>>>:
>>>> 
>>>> 
>>>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
>>>> 
>>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>>>> 
>>>>               for (RbdSnapInfo snap : snaps) {
>>>>                   if (image.snapIsProtected(snap.name)) {
>>>>                       s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>                       image.snapUnprotect(snap.name);
>>>>                   } else {
>>>>                       s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>>>>                   }
>>>>                   s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>                   image.snapRemove(snap.name);
>>>>               }
>>>> 
>>>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
>>>> 
>>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
>>>> 
>>>> 
>>>> To Reproduce (with ceph/rbd primary storage)
>>>> 
>>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
>>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
>>>> 3.  Restart management server
>>>> 4.  Deploy several VMs from templates
>>>> 5.  Destroy+expunge the VMs after they are running
>>>> 6.  Observe segfaults in management server
>>>> 
>>>> 
>>>> Workaround
>>>> 
>>>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
>>>> 
>>>> Segfault Logs
>>>> 
>>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
>>>> 
>>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
>>>> 
>>>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
>>> 
>>> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
>>> 
>>> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
>>> 
>>> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
>>> 
>>> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
>>> 
>>> Wido
>> 
>> 
>> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
>> 
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
>> 
>> Is there something obvious I’m missing here?
>> 
> 
> I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.
> 
> Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314>
> 
> Isn't there a package version mismatch in your Ceph cluster?
> 
> Wido
> 

I agree it’s very confusing and I’m running out of ideas as to what the cause may be.

Here are the package versions on all our related boxes in the lab.

http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g

I’ve also collected the most recent instances of our two segfaults/aborts below.

journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted

http://sprunge.us/LcYA

journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException

http://sprunge.us/SiCf

I also went back to look at the two places in the code where snapshot cleanup is taking place in our tree:

The place where the failed to unprotect exceptions seem to be triggered:
https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>

The cleanup code for rbd snapshots:
https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>

I’ve rolled a small patch into our testing tree that moves the context and image cleanup into a ‘finally’ block because I thought there may be a problem with those not being closed/freed when the unprotect threw an exception.  So the code in LibvirtStorageAdapter.java is slightly different than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows the changes.

https://github.com/myENA/cloudstack/pull/11/files

Does this make sense?  Is this even possibly related to the issue I’m seeing?

Once again, all the advice and attention is definitely appreciated.

— Aaron


Re: Ceph RBD related host agent segfault

Posted by Aaron Hurt <ah...@ena.com>.
> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
>> 
>> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
>> 
>> 
>> 
>>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>> wrote:
>>> 
>>>> 
>>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com> <mailto:ahurt@ena.com <ma...@ena.com>>>:
>>>> 
>>>> 
>>>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
>>>> 
>>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>>>> 
>>>>               for (RbdSnapInfo snap : snaps) {
>>>>                   if (image.snapIsProtected(snap.name)) {
>>>>                       s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>                       image.snapUnprotect(snap.name);
>>>>                   } else {
>>>>                       s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>>>>                   }
>>>>                   s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>                   image.snapRemove(snap.name);
>>>>               }
>>>> 
>>>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
>>>> 
>>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
>>>> 
>>>> 
>>>> To Reproduce (with ceph/rbd primary storage)
>>>> 
>>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
>>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
>>>> 3.  Restart management server
>>>> 4.  Deploy several VMs from templates
>>>> 5.  Destroy+expunge the VMs after they are running
>>>> 6.  Observe segfaults in management server
>>>> 
>>>> 
>>>> Workaround
>>>> 
>>>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
>>>> 
>>>> Segfault Logs
>>>> 
>>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
>>>> 
>>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
>>>> 
>>>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
>>> 
>>> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
>>> 
>>> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
>>> 
>>> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
>>> 
>>> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
>>> 
>>> Wido
>> 
>> 
>> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
>> 
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
>> 
>> Is there something obvious I’m missing here?
>> 
> 
> I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.
> 
> Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314>
> 
> Isn't there a package version mismatch in your Ceph cluster?
> 
> Wido
> 

I agree it’s very confusing and I’m running out of ideas as to what the cause may be.

Here are the package versions on all our related boxes in the lab.

http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g

I’ve also collected the most recent instances of our two segfaults/aborts below.

journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted

http://sprunge.us/LcYA

journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException

http://sprunge.us/SiCf

I also went back to look at the two places in the code where snapshot cleanup is taking place in our tree:

The place where the failed to unprotect exceptions seem to be triggered:
https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>

The cleanup code for rbd snapshots:
https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>

I’ve rolled a small patch into our testing tree that moves the context and image cleanup into a ‘finally’ block because I thought there may be a problem with those not being closed/freed when the unprotect threw an exception.  So the code in LibvirtStorageAdapter.java is slightly different than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows the changes.

https://github.com/myENA/cloudstack/pull/11/files

Does this make sense?  Is this even possibly related to the issue I’m seeing?

Once again, all the advice and attention is definitely appreciated.

— Aaron


Re: Ceph RBD related host agent segfault

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ah...@ena.com>:
> 
> 
> 
> > On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wi...@widodh.nl> wrote:
> > 
> >> 
> >> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
> >> 
> >> 
> >> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
> >> 
> >> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
> >> 
> >>                for (RbdSnapInfo snap : snaps) {
> >>                    if (image.snapIsProtected(snap.name)) {
> >>                        s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>                        image.snapUnprotect(snap.name);
> >>                    } else {
> >>                        s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
> >>                    }
> >>                    s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>                    image.snapRemove(snap.name);
> >>                }
> >> 
> >> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
> >> 
> >> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
> >> 
> >> 
> >> To Reproduce (with ceph/rbd primary storage)
> >> 
> >> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
> >> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
> >> 3.  Restart management server
> >> 4.  Deploy several VMs from templates
> >> 5.  Destroy+expunge the VMs after they are running
> >> 6.  Observe segfaults in management server
> >> 
> >> 
> >> Workaround
> >> 
> >> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
> >> 
> >> Segfault Logs
> >> 
> >> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>
> >> 
> >> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>
> >> 
> >> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
> > 
> > What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
> > 
> > It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
> > 
> > I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
> > 
> > The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
> > 
> > Wido
> 
> 
> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
> 
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
> 
> Is there something obvious I’m missing here?
> 

I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.

Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314

Isn't there a package version mismatch in your Ceph cluster?

Wido

> — Aaron
>

Re: Ceph RBD related host agent segfault

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ah...@ena.com>:
> 
> 
> 
> > On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wi...@widodh.nl> wrote:
> > 
> >> 
> >> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
> >> 
> >> 
> >> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
> >> 
> >> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
> >> 
> >>                for (RbdSnapInfo snap : snaps) {
> >>                    if (image.snapIsProtected(snap.name)) {
> >>                        s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>                        image.snapUnprotect(snap.name);
> >>                    } else {
> >>                        s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
> >>                    }
> >>                    s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
> >>                    image.snapRemove(snap.name);
> >>                }
> >> 
> >> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
> >> 
> >> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
> >> 
> >> 
> >> To Reproduce (with ceph/rbd primary storage)
> >> 
> >> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
> >> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
> >> 3.  Restart management server
> >> 4.  Deploy several VMs from templates
> >> 5.  Destroy+expunge the VMs after they are running
> >> 6.  Observe segfaults in management server
> >> 
> >> 
> >> Workaround
> >> 
> >> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
> >> 
> >> Segfault Logs
> >> 
> >> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>
> >> 
> >> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>
> >> 
> >> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
> > 
> > What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
> > 
> > It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
> > 
> > I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
> > 
> > The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
> > 
> > Wido
> 
> 
> We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.
> 
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
> 
> Is there something obvious I’m missing here?
> 

I am confused. In t his path you are not even touching Ceph, but still it crashes on librados.

Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314

Isn't there a package version mismatch in your Ceph cluster?

Wido

> — Aaron
>

Re: Ceph RBD related host agent segfault

Posted by Aaron Hurt <ah...@ena.com>.
> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
>> 
>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
>> 
>> 
>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
>> 
>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>> 
>>                for (RbdSnapInfo snap : snaps) {
>>                    if (image.snapIsProtected(snap.name)) {
>>                        s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>                        image.snapUnprotect(snap.name);
>>                    } else {
>>                        s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>>                    }
>>                    s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>                    image.snapRemove(snap.name);
>>                }
>> 
>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
>> 
>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
>> 
>> 
>> To Reproduce (with ceph/rbd primary storage)
>> 
>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
>> 3.  Restart management server
>> 4.  Deploy several VMs from templates
>> 5.  Destroy+expunge the VMs after they are running
>> 6.  Observe segfaults in management server
>> 
>> 
>> Workaround
>> 
>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
>> 
>> Segfault Logs
>> 
>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>
>> 
>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>
>> 
>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
> 
> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
> 
> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
> 
> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
> 
> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
> 
> Wido


We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.

Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…

Is there something obvious I’m missing here?

— Aaron


Re: Ceph RBD related host agent segfault

Posted by Aaron Hurt <ah...@ena.com>.
> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wi...@widodh.nl> wrote:
> 
>> 
>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <ma...@ena.com>>:
>> 
>> 
>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
>> 
>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>> 
>>                for (RbdSnapInfo snap : snaps) {
>>                    if (image.snapIsProtected(snap.name)) {
>>                        s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>                        image.snapUnprotect(snap.name);
>>                    } else {
>>                        s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>>                    }
>>                    s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>                    image.snapRemove(snap.name);
>>                }
>> 
>> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
>> 
>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230> <https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
>> 
>> 
>> To Reproduce (with ceph/rbd primary storage)
>> 
>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
>> 3.  Restart management server
>> 4.  Deploy several VMs from templates
>> 5.  Destroy+expunge the VMs after they are running
>> 6.  Observe segfaults in management server
>> 
>> 
>> Workaround
>> 
>> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
>> 
>> Segfault Logs
>> 
>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>
>> 
>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>
>> 
>> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.
> 
> What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.
> 
> It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.
> 
> I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.
> 
> The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.
> 
> Wido


We’re seeing this happen around other issues and it does seem to be related to java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this morning when performing a load balancer update.

Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh checkrouter.sh 169.254.3.93
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}] }
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section: global
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0 warning
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section: defaults
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: BACKUP\n","wait":0}}] }
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters to load_balancer.json took 237ms
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh update_config.py 169.254.0.11 load_balancer.json
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main process exited, code=exited, status=134/n/a
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service entered failed state.
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff time over, scheduling restart.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…

Is there something obvious I’m missing here?

— Aaron


Re: Ceph RBD related host agent segfault

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ah...@ena.com>:
> 
> 
> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
> 
> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
> 
>                 for (RbdSnapInfo snap : snaps) {
>                     if (image.snapIsProtected(snap.name)) {
>                         s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>                         image.snapUnprotect(snap.name);
>                     } else {
>                         s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>                     }
>                     s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>                     image.snapRemove(snap.name);
>                 }
> 
> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
> 
> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
> 
> 
> To Reproduce (with ceph/rbd primary storage)
> 
> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
> 3.  Restart management server
> 4.  Deploy several VMs from templates
> 5.  Destroy+expunge the VMs after they are running
> 6.  Observe segfaults in management server
> 
> 
> Workaround
> 
> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
> 
> Segfault Logs
> 
> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>
> 
> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>
> 
> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.

What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.

It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.

I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.

The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.

Wido

> 
> — Aaron

Re: Ceph RBD related host agent segfault

Posted by Wido den Hollander <wi...@widodh.nl>.
> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ah...@ena.com>:
> 
> 
> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados (librbd) on the host agent.  We’ve been able to isolate this to two possible places in the code:
> 
> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
> 
>                 for (RbdSnapInfo snap : snaps) {
>                     if (image.snapIsProtected(snap.name)) {
>                         s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>                         image.snapUnprotect(snap.name);
>                     } else {
>                         s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name + " is not protected.");
>                     }
>                     s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>                     image.snapRemove(snap.name);
>                 }
> 
> Should we be checking if the unprotect actually failed/succeeded before attempting to remove the snapshot?
> 
> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>) duplicates some of this functionality and there doesn’t seem to be any protection preventing deletePhysicalDisk and the cleanup routine being run simultaneously.
> 
> 
> To Reproduce (with ceph/rbd primary storage)
> 
> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10 seconds
> 3.  Restart management server
> 4.  Deploy several VMs from templates
> 5.  Destroy+expunge the VMs after they are running
> 6.  Observe segfaults in management server
> 
> 
> Workaround
> 
> We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
> 
> Segfault Logs
> 
> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>
> 
> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>
> 
> We would really appreciate any feedback and/or confirmation from the community around the above issues.  I’d also be happy to provide any additional information needed to get this addressed.

What seems to be happening is that it failed to unprotect the snapshot of the volume. This could have various reasons, for example if there is a child image of the snapshot. I don't think it's the case however.

It could still be that it tries to remove the master/golden image from the template while it still has childs attached to that snapshot.

I'm not sure if this is due to rados-java or a bug in librados. The Java could should just throw a exception and not completely crash the JVM. This happens lower in the code and not in Java.

The assert shows this also happens when Java is talking to libvirt. I guess a librados bug, but now completely sure.

Wido

> 
> — Aaron