You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cloudstack.apache.org by GitBox <gi...@apache.org> on 2021/06/29 07:24:32 UTC

[GitHub] [cloudstack] GabrielBrascher opened a new pull request #4978: KVM High Availability regardless of storage

GabrielBrascher opened a new pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978


   ### Description
   
   Currently, KVM HA implementation works only if the cluster has at least one primary storage served via NFS. This is due to the NFS heartbeat script used to check if the host is healthy. This implementation adds health checks that work regardless of a storage pool. This is done via a Java client that checks Agent status via a webserver.
   
   The additional web-server exposes a simple JSON API that returns a list of Virtual Machines that are running on that host according to Libvirt. This way, KVM HA can verify, via Libvirt, VMs status with HTTP-call to this simple webserver and determine if the host is actually down or if it is just the Java Agent which has crashed.
   
   #### New KVM HA Helper component
   The following image shows how the new KVM-HA-Helper web-service is integrated. The current NFS HeartBeat execution flow will still be used aligned with the new HA-Helper.
   
   <p align="center">
     <img width="460" height="300" src="https://user-images.githubusercontent.com/5025148/122809301-522bbe00-d2a4-11eb-9ebd-548d4f74b5fe.png">
   </p>
   
   #### High Availability Workflow
   
   Proposed workflow where the HA Check takes into account both **NFS Heartbeat** and the **KVM HA Helper** checks.
   
   **Note that** in order to simplify the diagram it is ignored the whole [HA state machine](https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA). However, if NFS and HA Helper fails not necessarily it is going to Recover/Fence a host as depending on the HA configurations it needs to re-check some times until it reaches a threshold of accepted failures.
   
   <p align="center">
     <img width="400" height="500" src="https://user-images.githubusercontent.com/5025148/122818822-0a129880-d2b0-11eb-8085-226eb900a2f1.png">
   </p>
   
   ### Types of changes
   
   - [ ] Breaking change (fix or feature that would cause existing functionality to change)
   - [x] New feature (non-breaking change which adds functionality)
   - [ ] Bug fix (non-breaking change which fixes an issue)
   - [ ] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   
   ### Feature/Enhancement Scale or Bug Severity
   
   #### Feature/Enhancement Scale
   
   - [x] Major
   - [ ] Minor
   
   <!-- ### How Has This Been Tested? -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899832773


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r655691011



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GabrielBrascher indeed, I see no way to it throw an exception here

##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GabrielBrascher indeed, I see no way to it throw an exception




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-889588714


   <b>Trillian test result (tid-1406)</b>
   Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
   Total time taken: 60669 seconds
   Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4978-t1406-kvm-centos7.zip
   Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
   Intermittent failure detected: /marvin/tests/smoke/test_network.py
   Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
   Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
   Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
   Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
   Smoke tests completed. 84 look OK, 5 have error(s)
   Only failed tests results shown below:
   
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   test_01_invalid_upgrade_kubernetes_cluster | `Failure` | 3609.09 | test_kubernetes_clusters.py
   test_02_deploy_and_upgrade_kubernetes_cluster | `Failure` | 3611.77 | test_kubernetes_clusters.py
   test_03_deploy_and_scale_kubernetes_cluster | `Failure` | 0.05 | test_kubernetes_clusters.py
   test_04_basic_lifecycle_kubernetes_cluster | `Failure` | 0.05 | test_kubernetes_clusters.py
   test_05_delete_kubernetes_cluster | `Failure` | 0.04 | test_kubernetes_clusters.py
   test_07_deploy_kubernetes_ha_cluster | `Failure` | 0.04 | test_kubernetes_clusters.py
   test_08_deploy_and_upgrade_kubernetes_ha_cluster | `Failure` | 0.05 | test_kubernetes_clusters.py
   test_09_delete_kubernetes_ha_cluster | `Failure` | 0.04 | test_kubernetes_clusters.py
   ContextSuite context=TestKubernetesCluster>:teardown | `Error` | 44.49 | test_kubernetes_clusters.py
   test_02_isolate_network_FW_PF_default_routes_egress_false | `Failure` | 96.70 | test_routers_network_ops.py
   test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false | `Failure` | 354.37 | test_routers_network_ops.py
   test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL | `Failure` | 582.87 | test_vpc_redundant.py
   test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers | `Failure` | 498.28 | test_vpc_redundant.py
   test_01_redundant_vpc_site2site_vpn | `Failure` | 407.94 | test_vpc_vpn.py
   test_01_vpc_site2site_vpn_multiple_options | `Failure` | 300.57 | test_vpc_vpn.py
   test_01_vpc_site2site_vpn | `Failure` | 247.50 | test_vpc_vpn.py
   test_hostha_enable_ha_when_host_disabled | `Error` | 1.52 | test_hostha_kvm.py
   test_hostha_enable_ha_when_host_in_maintenance | `Error` | 303.75 | test_hostha_kvm.py
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher removed a comment on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher removed a comment on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-888274981


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-875997162


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher edited a comment on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher edited a comment on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899835219


   For reference, the tests that failed [here](https://github.com/apache/cloudstack/pull/4978#issuecomment-899504564) are:
   
   1. test_hostha_enable_ha_when_host_disabled
   2. test_hostha_enable_ha_when_host_in_maintenance
   
   I am still checking why they are failing and if it is related to this specific PR.
   
   Logs:
   ```
   testcase classname="tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_configure_default_driver" time="0.719"
   "tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_enable_ha_when_host_disabled" time="1.115" "marvin.cloudstackException.CloudstackAPIException" message="Execute cmd: updatehost failed, due to: errorCode: 530, errorText:Failed to update host:2,No next resource state found for current state = Maintenance event = Disable":
     File "/usr/lib64/python3.6/unittest/case.py", line 60, in testPartExecutor
       yield
     File "/usr/lib64/python3.6/unittest/case.py", line 622, in run
       testMethod()
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 259, in test_hostha_enable_ha_when_host_disabled
       self.disableHost(self.host.id)
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 609, in disableHost
       response = self.apiclient.updateHost(cmd)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackAPI/cloudstackAPIClient.py", line 915, in updateHost
       response = self.connection.marvinRequest(command, response_type=response, method=method)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 381, in marvinRequest
       raise e
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 376, in marvinRequest
       raise self.__lastError
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 310, in __parseAndGetResponse
       response_cls)
     File "/usr/local/lib/python3.6/site-packages/marvin/jsonHelper.py", line 155, in getResultObj
       raise cloudstackException.CloudstackAPIException(respname, errMsg)
   marvin.cloudstackException.CloudstackAPIException: Execute cmd: updatehost failed, due to: errorCode: 530, errorText:Failed to update host:2,No next resource state found for current state = Maintenance event = Disable
   === TestName: test_hostha_enable_ha_when_host_disabled | Status : EXCEPTION ===
   
   ------------------------
   
   "tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_enable_ha_when_host_disconected" time="14.522"
   "checkForState:: expected=Ineligible, actual={haenable : True, hastate : 'Ineligible', haprovider : 'kvmhaprovider'}
   ]]></system-out></testcase><testcase classname="tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_enable_ha_when_host_in_maintenance" time="303.923" message="Job failed: {accountid : '5e5de944-fe40-11eb-9d50-1e003b000428', userid : '5e5edcd4-fe40-11eb-9d50-1e003b000428', cmd : 'org.apache.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd', jobstatus : 2, jobprocstatus : 0, jobresultcode : 530, jobresulttype : 'object', jobresult : {errorcode : 530, errortext : 'Failed to prepare host for maintenance due to: Host is already in state Maintenance. Cannot recall for maintenance until resolved.'}, jobinstancetype : 'Host', jobinstanceid : '23cb0154-392a-4af3-80e2-3bf0d24bc341', created : '2021-08-16T13:19:01+0000', completed : '2021-08-16T13:19:01+0000', jobid : '946cb1c8-37cb-4110-b5ea-529da895d05e':
     File "/usr/lib64/python3.6/unittest/case.py", line 60, in testPartExecutor
       yield
     File "/usr/lib64/python3.6/unittest/case.py", line 622, in run
       testMethod()
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 285, in test_hostha_enable_ha_when_host_in_maintenance
       self.setHostToMaintanance(self.host.id)
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 623, in setHostToMaintanance
       response = self.apiclient.prepareHostForMaintenance(cmd)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackAPI/cloudstackAPIClient.py", line 2435, in prepareHostForMaintenance
       response = self.connection.marvinRequest(command, response_type=response, method=method)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 381, in marvinRequest
       raise e
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 376, in marvinRequest
       raise self.__lastError
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 105, in __poll
       % async_response)
   Exception: Job failed: {accountid : '5e5de944-fe40-11eb-9d50-1e003b000428', userid : '5e5edcd4-fe40-11eb-9d50-1e003b000428', cmd : 'org.apache.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd', jobstatus : 2, jobprocstatus : 0, jobresultcode : 530, jobresulttype : 'object', jobresult : {errorcode : 530, errortext : 'Failed to prepare host for maintenance due to: Host is already in state Maintenance. Cannot recall for maintenance until resolved.'}, jobinstancetype : 'Host', jobinstanceid : '23cb0154-392a-4af3-80e2-3bf0d24bc341', created : '2021-08-16T13:19:01+0000', completed : '2021-08-16T13:19:01+0000', jobid : '946cb1c8-37cb-4110-b5ea-529da895d05e'}
   === TestName: test_hostha_enable_ha_when_host_in_maintenance | Status : EXCEPTION ===
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd edited a comment on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd edited a comment on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-934173009


   > You are right, this something to be careful about.
   > We've configured the service in a way that it always starts on boot and if the process/job is killed for any reason it gets restarted as well. The only way of stopping it is via systemd (e.g. systemctl stop cloudstack-hahelper.service)
   
   Could you maybe explore systemd itself, there are ways to use dependencies and `targets` to ensure the agent is always up and unless explicitly stopped by the admin. For example, there's also restart on failure option (https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=).
   
   > We did not implement such a way of telling that the agent has been "intentionally stopped". This would rely on Admins disabling it on the CloudStack side.
   > I will need to add some information in the documentation about how to handle the cluster with this agent.
   
   See above, most admins may not remember about this feature and I wonder if stopping an agent to do maintenance work could cause side-effects. Maybe look at my above suggestion on exploiting systemd features. Docs +1
   
   > I can look into a way of adding CA certificates and validate the communications. For now, it has no such validation; however, it binds only with the node IP in the management network (which in theory is an isolated/secure network).
   
   I think if adding a new service for this feature is unavoidable, we should absolutely (a) have the service use CA-framework issued certificates to serve using secured TLS/SSL certs (i.e. on https), (b) provide a default off option (which you've confirmed exists via a cluster-scope global setting), (c) have firewall-config enabled when the agent either starts (or the service/process starts?) or document on how to use this service (i.e. enable port 8080). (Probably not a good idea to expose whole of libvirtd over network, but one option may involve just exposing libvirtd over tls/ssh to other neighbour hosts https://libvirt.org/remote.html which won't require any additional services).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-849638463


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r830156484



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAConfig.java
##########
@@ -53,4 +53,32 @@
     public static final ConfigKey<Long> KvmHAFenceTimeout = new ConfigKey<>("Advanced", Long.class, "kvm.ha.fence.timeout", "60",
             "The maximum length of time, in seconds, expected for a fence operation to complete.", true, ConfigKey.Scope.Cluster);
 
+    public static final ConfigKey<Integer> KvmHaWebservicePort = new ConfigKey<Integer>("Advanced", Integer.class, "kvm.ha.webservice.port", "8443",
+            "It sets the port used to communicate with the KVM HA Agent Microservice that is running on KVM nodes. Default value is 8443.",
+            true, ConfigKey.Scope.Cluster);
+
+    public static final ConfigKey<Boolean> IsKvmHaWebserviceEnabled = new ConfigKey<Boolean>("Advanced", Boolean.class, "kvm.ha.webservice.enabled", "false",

Review comment:
       This setting is a bit misleading, one can disable the setting however the HA webserver will continue running on each host. I think it could be removed since when it is used, the check through the agent is also performed

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,346 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.host.Status;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
+import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.http.impl.client.HttpClients;
+import org.apache.http.ssl.SSLContexts;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import javax.inject.Inject;
+import javax.net.ssl.SSLContext;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.security.KeyManagementException;
+import java.security.KeyStoreException;
+import java.security.NoSuchAlgorithmException;
+import java.util.Base64;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK_NEIGHBOUR = "check-neighbour";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final JsonParser JSON_PARSER = new JsonParser();
+    static final String HTTP_PROTOCOL = "http";
+    static final String HTTPS_PROTOCOL = "https";
+    private final static String APPLICATION_JSON = "application/json";
+    private final static String ACCEPT = "accept";
+
+    @Inject
+    private VMInstanceDao vmInstanceDao;

Review comment:
       I think some of the logic could be placed on other class (maybe KvmHaHelper) as long with the DB access, and keep the client class simply to interact with the HA agent

##########
File path: debian/control
##########
@@ -56,3 +56,10 @@ Package: cloudstack-integration-tests
 Architecture: all
 Depends: ${misc:Depends}, cloudstack-marvin (= ${source:Version})
 Description: The CloudStack Marvin integration tests
+
+Package: cloudstack-agent-ha-helper

Review comment:
       Ping @GabrielBrascher 

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,346 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.host.Status;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
+import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.http.impl.client.HttpClients;
+import org.apache.http.ssl.SSLContexts;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import javax.inject.Inject;
+import javax.net.ssl.SSLContext;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.security.KeyManagementException;
+import java.security.KeyStoreException;
+import java.security.NoSuchAlgorithmException;
+import java.util.Base64;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK_NEIGHBOUR = "check-neighbour";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final JsonParser JSON_PARSER = new JsonParser();
+    static final String HTTP_PROTOCOL = "http";
+    static final String HTTPS_PROTOCOL = "https";
+    private final static String APPLICATION_JSON = "application/json";
+    private final static String ACCEPT = "accept";
+
+    @Inject
+    private VMInstanceDao vmInstanceDao;
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    public int countRunningVmsOnAgent(Host host) {
+        String protocol = getProtocolString();
+        String url = String.format("%s://%s:%d", protocol, host.getPrivateIpAddress(), getKvmHaMicroservicePortValue(host));
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     * Returns the HTTP protocol. It can be 'HTTP' or 'HTTPS' depending on configuration 'kvm.ha.webservice.ssl.enabled'
+     */
+    protected String getProtocolString() {
+        boolean KvmHaWebserviceSslEnabled = KVMHAConfig.KvmHaWebserviceSslEnabled.value();
+        String protocol = HTTP_PROTOCOL;
+        if (KvmHaWebserviceSslEnabled) {
+            protocol = HTTPS_PROTOCOL;
+        }
+        return protocol;
+    }
+
+    /**
+     * Returns the port from the KVM HA Helper according to the configuration 'kvm.ha.webservice.port'
+     */
+    protected int getKvmHaMicroservicePortValue(Host host) {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), host.getClusterId(), host));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    public List<VMInstanceVO> listVmsOnHost(Host host) {

Review comment:
       Maybe this method could be outside the client with the DAO and keep `countRunningVmsOnAgent` on it?

##########
File path: packaging/systemd/cloudstack-agent-ha-helper.service
##########
@@ -0,0 +1,36 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Do not modify this file as your changes will be lost in the next CSM update.
+# If you need to add specific dependencies to this service unit do it in the
+# /etc/systemd/system/cloudstack-management.service.d/ directory
+
+[Unit]
+Description=CloudStack Agent HA Helper
+Documentation=http://www.cloudstack.org/
+Requires=libvirtd.service
+After=libvirtd.service
+
+[Service]
+Type=simple
+EnvironmentFile=/etc/default/cloudstack-agent-ha-helper

Review comment:
       Hi @GabrielBrascher can you check this one as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r830862417



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,346 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.host.Status;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
+import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.http.impl.client.HttpClients;
+import org.apache.http.ssl.SSLContexts;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import javax.inject.Inject;
+import javax.net.ssl.SSLContext;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.security.KeyManagementException;
+import java.security.KeyStoreException;
+import java.security.NoSuchAlgorithmException;
+import java.util.Base64;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK_NEIGHBOUR = "check-neighbour";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final JsonParser JSON_PARSER = new JsonParser();
+    static final String HTTP_PROTOCOL = "http";
+    static final String HTTPS_PROTOCOL = "https";
+    private final static String APPLICATION_JSON = "application/json";
+    private final static String ACCEPT = "accept";
+
+    @Inject
+    private VMInstanceDao vmInstanceDao;
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    public int countRunningVmsOnAgent(Host host) {
+        String protocol = getProtocolString();
+        String url = String.format("%s://%s:%d", protocol, host.getPrivateIpAddress(), getKvmHaMicroservicePortValue(host));
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     * Returns the HTTP protocol. It can be 'HTTP' or 'HTTPS' depending on configuration 'kvm.ha.webservice.ssl.enabled'
+     */
+    protected String getProtocolString() {
+        boolean KvmHaWebserviceSslEnabled = KVMHAConfig.KvmHaWebserviceSslEnabled.value();
+        String protocol = HTTP_PROTOCOL;
+        if (KvmHaWebserviceSslEnabled) {
+            protocol = HTTPS_PROTOCOL;
+        }
+        return protocol;
+    }
+
+    /**
+     * Returns the port from the KVM HA Helper according to the configuration 'kvm.ha.webservice.port'
+     */
+    protected int getKvmHaMicroservicePortValue(Host host) {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), host.getClusterId(), host));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    public List<VMInstanceVO> listVmsOnHost(Host host) {

Review comment:
       That makes sense, I will re-check this part of the implementation.
   Thanks for the review, @nvazquez!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908501001


   @DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063577366


   <b>Trillian test result (tid-3548)</b>
   Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
   Total time taken: 26878 seconds
   Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4978-t3548-kvm-centos7.zip
   Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r665702938



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final double PROBLEMATIC_HOSTS_RATIO_ACCEPTED = 0.3;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.value());
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {
+                boolean isReachable = kvmHaAgentClient.isHostReachableByNeighbour(neighbor, host);
+                if (isReachable) {
+                    String.format("%s is reachable by neighbour %s. If CloudStack is failing to reach the respective host then it is probably a network issue between the host "
+                            + "and CloudStack management server.", host, neighbor);
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the host is healthy. The health-check is performed via HTTP GET request to a service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    public boolean isKvmHealthyCheckViaLibvirt(Host host) {
+        boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
+
+        if (!isKvmHaAgentHealthy) {
+            if (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) {
+                return true;
+            }
+        }
+
+        return isKvmHaAgentHealthy;

Review comment:
       Personally, I do prefer _IFs_ than a _ternary_. At least in such cases where a ternary transforms into a huge line. But that might be just me used to the "old fashion" _IFs_ ... :thinking:.
   
   But I have nothing against ternary, we could definitely do such a change.
   
   1. From:
   ```
   public boolean isKvmHealthyCheckViaLibvirt(Host host) {
       boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
   
       if (!isKvmHaAgentHealthy) {
           if (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) {
           return true;
           }
       }
       
       return isKvmHaAgentHealthy;
   }
   ```
   
   2. To:
   ```
   public boolean isKvmHealthyCheckViaLibvirt(Host host) {
       boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
       return !isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) ? true : isKvmHaAgentHealthy;
   }
   ```
   
   OR
   ```
   public boolean isKvmHealthyCheckViaLibvirt(Host host) {
       boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
       return !isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host))
       ? true
       : isKvmHaAgentHealthy;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063033056


   @rohityadavcloud @PaulAngus I've addressed your security concerns.
   
   By default, the HA helper service will be provided via HTTP + SSL, with Basic Auth. If one wants to keep it just HTTP, it is still possible to do it via the "insecure" mode of the script.
   Service is deployed with a default configuration, but can be changed following the respective arguments:
   ```
       Optional arguments:
         -h, --help                Show this help message and exit
         -i, --insecure            Allows to run the HTTP server without SSL
         -p, --port PORT           Port to be used by the agent-ha-helper server
         -u, --username USERNAME   Sets the user for server authentication
         -k, --password PASSWORD   Keyword/password for server authentication
   ```
   
   This requires that both ends (management and KVM agents) are configured properly. The default configuration is set to SSL + Authentication with a default username + password (obviously, admins can easily change it).
   
   With SSL + Authentication, only the management node and configured KVM hosts are able serve and consume this API.
   
   It is important to raise that this API **DOES NOT** allow to run Libvirt commands. It only lists the running VMs, and if the host is reachable (`Up` vs `Down`). If an attacker gets access to such API, it would be able just to collect the number of running VMs.
   
   Also, note that the "secure mode" works only when the KVM  nodes have certificates provided (via `provisionCertificate`), otherwise the service will fail.
   
   I will work on the whole documentation needed to make sure users would be guided, in case this implementation gets merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r830861414



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAConfig.java
##########
@@ -53,4 +53,32 @@
     public static final ConfigKey<Long> KvmHAFenceTimeout = new ConfigKey<>("Advanced", Long.class, "kvm.ha.fence.timeout", "60",
             "The maximum length of time, in seconds, expected for a fence operation to complete.", true, ConfigKey.Scope.Cluster);
 
+    public static final ConfigKey<Integer> KvmHaWebservicePort = new ConfigKey<Integer>("Advanced", Integer.class, "kvm.ha.webservice.port", "8443",
+            "It sets the port used to communicate with the KVM HA Agent Microservice that is running on KVM nodes. Default value is 8443.",
+            true, ConfigKey.Scope.Cluster);
+
+    public static final ConfigKey<Boolean> IsKvmHaWebserviceEnabled = new ConfigKey<Boolean>("Advanced", Boolean.class, "kvm.ha.webservice.enabled", "false",

Review comment:
       @nvazquez maybe I need to rename or change the description to make it more clear.
   
   This is to disable the KVM HA helper. It is an "on/off" switch that makes CloudStack check or not the added webserver on KVM nodes.
   
   1. When it is set to `false` (**default**) it does not validate via the KVM HA Helper Client, thus it keeps the normal behavior for the CloudStack KVM HA flow.
   2. When it is enabled, then it will check the web server's health.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-881633478


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian. SL-JID 569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908872068


   <b>Trillian test result (tid-1839)</b>
   Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
   Total time taken: 37435 seconds
   Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4978-t1839-kvm-centos7.zip
   Intermittent failure detected: /marvin/tests/smoke/test_network.py
   Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
   Smoke tests completed. 88 look OK, 1 have error(s)
   Only failed tests results shown below:
   
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   test_hostha_kvm_host_degraded | `Failure` | 768.71 | test_hostha_kvm.py
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-866300957


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-912180722


   Thanks @GabrielBrascher for updating, please advise if the failure is related to the PR after your investigation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r665719771



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final double PROBLEMATIC_HOSTS_RATIO_ACCEPTED = 0.3;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.value());
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {
+                boolean isReachable = kvmHaAgentClient.isHostReachableByNeighbour(neighbor, host);
+                if (isReachable) {
+                    String.format("%s is reachable by neighbour %s. If CloudStack is failing to reach the respective host then it is probably a network issue between the host "
+                            + "and CloudStack management server.", host, neighbor);
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the host is healthy. The health-check is performed via HTTP GET request to a service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    public boolean isKvmHealthyCheckViaLibvirt(Host host) {
+        boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
+
+        if (!isKvmHaAgentHealthy) {
+            if (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) {
+                return true;
+            }
+        }
+
+        return isKvmHaAgentHealthy;

Review comment:
       @GabrielBrascher I see no problem in keeping the `if` statement. In this case, I would only suggest you to unify the `ifs`, to avoid nesting:
   
   ```suggestion
           boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
   
           if (!isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host))) {
               return true;
           }
   
           return isKvmHaAgentHealthy;
   ```
   
   From this unify came to me the idea of a ternary...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] DaanHoogland commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
DaanHoogland commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908500543


   @blueorangutan test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908413093






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan removed a comment on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan removed a comment on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-876010424






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r655677278



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GutoVeronezi I decided to remove this catch.
   When checking the easySend there is already enough catches. If it does not catch the exception ... I don't know what would catch it:
   
   ```
   public Answer easySend(final Long hostId, final Command cmd) {
           try {
                   ...
                   ...
                   ...
           } catch (final AgentUnavailableException e) {
               s_logger.warn(e.getMessage());
               return null;
           } catch (final OperationTimedoutException e) {
               s_logger.warn("Operation timed out: " + e.getMessage());
               return null;
           } catch (final Exception e) {
               s_logger.warn("Exception while sending", e);
               return null;
           }
   ```

##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GutoVeronezi I decided to remove this catch.
   When checking the easySend there is already enough catches. If it does not catch the exception ... I don't know what would catch it:
   
   ```
   public Answer easySend(final Long hostId, final Command cmd) {
           try {
                   ...
                   ...
                   ...
           } catch (final AgentUnavailableException e) {
               s_logger.warn(e.getMessage());
               return null;
           } catch (final OperationTimedoutException e) {
               s_logger.warn("Operation timed out: " + e.getMessage());
               return null;
           } catch (final Exception e) {
               s_logger.warn("Exception while sending", e);
               return null;
           }
   ```
   
   For reference: [AgentManagerImpl.java#L938](https://github.com/apache/cloudstack/blob/4f6851f4c057a9524231e75285ba2f5257ff640b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java#L938)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-933484526


   @rhtyd thanks for the review. I hope that I can address all your comments here:
   
   > we should explore options to implement this without introducing a new service (my main concern is from security and upgrade point of view, a lot of people don't like non-essential services running on hypervisor)
   
   I understand that we should avoid populating new services, but I see HA as an essential part, and having it decoupled from the CloudStack agent helps with avoiding specific problems with the Java process.
   
   Additionally, this PR adds a global settings (on cluster scope) `kvm.ha.webservice.enabled`. By default, it is set to false, one can easily enable/disable it which results in CloudStack HA workflow skipping or not the checks for the KVM HA Helper.
   
   > for example, (1) what if I the admin wants to do some maintainance etc which requires stopping of the agent - in that case could your changes cause any side-effect, (2) systemd can be configured (probably already is?) to have this service always start on boot and on-crash/on-error
   
   You are right, this something to be careful about.
   We've configured the service in a way that it always starts on boot and if the process/job is killed for any reason it gets restarted as well. The only way of stopping it is via systemd (e.g. `systemctl stop cloudstack-hahelper.service`)
   
   > agent has a stop command answer it can tell mgmt server why it is stopping - that can be used intelligently to not cause HA led migrations (I haven't checked, probably already-is?)
   
   We did not implement such a way of telling that the agent has been "intentionally stopped". This would rely on Admins disabling it on the CloudStack side.
   I will need to add some information in the documentation about how to handle the cluster with this agent.
   
   > if this new service is essential, can it be secured using CA-framework generated certificates so at least the communication is validated (the simplest being server certificate was signed/created against the root CA cert)
   
   I can look into a way of adding CA certificates and validate the communications. For now, it has no such validation; however, it binds only with the node IP in the management network (which in theory is an isolated/secure network).
   
   > and a global setting/kill-switch for users who don't want/need this additional feature/service (for ex. NFS users?) and have it disabled by default
   
   Perfect, this is important indeed. We've added it via `kvm.ha.webservice.enabled`. One can set it per cluster, thus managing specifically which cluster is intended to have it enabled/disabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-870344180


   @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-888582908


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-876137136


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] wido commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
wido commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r627372618



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -85,12 +89,35 @@ public Status isAgentAlive(Host agent) {
                 break;
             }
         }
-        if (!hasNfs) {
-            s_logger.warn(
-                    "Agent investigation was requested on host " + agent + ", but host does not support investigation because it has no NFS storage. Skipping investigation.");
-            return Status.Disconnected;
+        Status agentStatus = Status.Disconnected;
+        if (hasNfs) {
+            agentStatus = checkAgentStatusViaNfs(agent);
+            s_logger.debug(String.format("Agent investigation was requested on host %s. Agent status via NFS heartbeat is %s.", agent, agentStatus));
+        } else {
+            s_logger.debug(String.format("Agent investigation was requested on host %s, but host has no NFS storage. Skipping investigation via NFS.", agent));
         }
 
+        agentStatus = checkAgentStatusViaKvmHaAgent(agent, agentStatus);
+
+        return agentStatus;
+    }
+
+    /**
+     * It checks the KVM node healthy via KVM HA Agent. If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    private Status checkAgentStatusViaKvmHaAgent(Host agent, Status agentStatus) {
+        KvmHaAgentClient kvmHaAgentClient = new KvmHaAgentClient(agent);
+        boolean isVmsCountOnKvmMatchingWithDatabase = kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+        if(isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            s_logger.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected."));

Review comment:
       Shouldn't we pass an argument for '%s'?
   
   Right now this won't print something properly.

##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -85,12 +89,35 @@ public Status isAgentAlive(Host agent) {
                 break;
             }
         }
-        if (!hasNfs) {
-            s_logger.warn(
-                    "Agent investigation was requested on host " + agent + ", but host does not support investigation because it has no NFS storage. Skipping investigation.");
-            return Status.Disconnected;
+        Status agentStatus = Status.Disconnected;
+        if (hasNfs) {
+            agentStatus = checkAgentStatusViaNfs(agent);
+            s_logger.debug(String.format("Agent investigation was requested on host %s. Agent status via NFS heartbeat is %s.", agent, agentStatus));
+        } else {
+            s_logger.debug(String.format("Agent investigation was requested on host %s, but host has no NFS storage. Skipping investigation via NFS.", agent));
         }
 
+        agentStatus = checkAgentStatusViaKvmHaAgent(agent, agentStatus);
+
+        return agentStatus;
+    }
+
+    /**
+     * It checks the KVM node healthy via KVM HA Agent. If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    private Status checkAgentStatusViaKvmHaAgent(Host agent, Status agentStatus) {
+        KvmHaAgentClient kvmHaAgentClient = new KvmHaAgentClient(agent);
+        boolean isVmsCountOnKvmMatchingWithDatabase = kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+        if(isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            s_logger.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected."));
+        } else {
+            s_logger.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent"));

Review comment:
       Shouldn't we pass an argument for '%s'?
   
   Right now this won't print something properly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] NuxRo commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
NuxRo commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-926120891


   Ok, a couple of questions:
   
   1 - as Rohit asked, why can't we do the checks over SSH instead of running a separate service which might have security and other implications? Hypervisors are already expected to accept incoming SSH from the management server(s).
   2 - It's not 100% clear, people will still need to rely on the old NFS HA method, right? Your work merely adds additional checks. Any way we can get rid of NFS at this point?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-919741762


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 1254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r665765431



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -59,29 +68,63 @@
     @Inject
     private AgentManager agentMgr;
     @Inject
-    private PrimaryDataStoreDao storagePool;
-    @Inject
     private StorageManager storageManager;
     @Inject
+    private PrimaryDataStoreDao storagePool;
+    @Inject
     private ResourceManager resourceManager;
+    @Inject
+    private StoragePoolHostDao storagePoolHostDao;
+    @Inject
+    private KvmHaHelper kvmHaHelper;
+
+    private static final Set<Storage.StoragePoolType> NFS_POOL_TYPE = new HashSet<>(Arrays.asList(Storage.StoragePoolType.NetworkFilesystem, Storage.StoragePoolType.ManagedNFS));
+    private static final Set<Hypervisor.HypervisorType> KVM_OR_LXC = new HashSet<>(Arrays.asList(Hypervisor.HypervisorType.KVM, Hypervisor.HypervisorType.LXC));
 
     @Override
-    public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException {
+    public boolean isActive(Host host, DateTime suspectTime) throws HACheckerException {
         try {
-            return isVMActivtyOnHost(r, suspectTime);
+            return isVMActivtyOnHost(host, suspectTime);
         } catch (HACheckerException e) {
             //Re-throwing the exception to avoid poluting the 'HACheckerException' already thrown
             throw e;
-        } catch (Exception e){
-            String message = String.format("Operation timed out, probably the %s is not reachable.", r.toString());
+        } catch (Exception e) {
+            String message = String.format("Operation timed out, probably the %s is not reachable.", host.toString());
             LOG.warn(message, e);
             throw new HACheckerException(message, e);
         }
     }
 
     @Override
-    public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+    public boolean isHealthy(Host host) {
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(host);
+        boolean isKvmHaWebserviceEnabled = kvmHaHelper.isKvmHaWebserviceEnabled(host);
+
+        if (isHostServedByNfsPool) {
+            isHealthy = isHealthViaNfs(host);
+        }
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        if (kvmHaHelper.isKvmHealthyCheckViaLibvirt(host) && !isHealthy) {
+            return true;
+        }
+
+        return isHealthy;
+    }
+
+    private boolean isHealthViaNfs(Host r) {
+        boolean isHealthy = true;
+        if (isHostServedByNfsPool(r)) {
+            isHealthy = isAgentActive(r);
+            if (!isHealthy) {
+                LOG.warn(String.format("NFS storage health check failed for %s. It seems that a storage does not have activity.", r.toString()));
+            }
+        }

Review comment:
       Done, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher removed a comment on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher removed a comment on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-875997033


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r655691011



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GabrielBrascher indeed, I see no way to it throw an exception here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063209940


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 2814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1005727937


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 2096


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r629568870



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -213,6 +278,18 @@ protected boolean verifyActivityOfStorageOnHost(HashMap<StoragePool, List<Volume
         return poolVolMap;
     }
 
+    private boolean isHostServedByNfsPool(Host agent) {
+        List<StoragePoolHostVO> storagesOnHost = storagePoolHostDao.listByHostId(agent.getId());
+        for (StoragePoolHostVO storagePoolHostRef : storagesOnHost) {
+            StoragePoolVO storagePool = this.storagePool.findById(storagePoolHostRef.getPoolId());
+            if(Storage.StoragePoolType.NetworkFilesystem == storagePool.getPoolType()
+                    || Storage.StoragePoolType.ManagedNFS == storagePool.getPoolType()) {

Review comment:
       The collection cited in https://github.com/apache/cloudstack/pull/4978/files#r629568449 could be used here too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-849618656


   @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-876149233


   Packaging result: :heavy_multiplication_x: el7 :heavy_multiplication_x: el8 :heavy_check_mark: debian. SL-JID 496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] DaanHoogland commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
DaanHoogland commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908412813






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r689785659



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        double acceptedProblematicHostsRatio = KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.valueIn(host.getClusterId());
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * acceptedProblematicHostsRatio);
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {
+                boolean isReachable = kvmHaAgentClient.isHostReachableByNeighbour(neighbor, host);
+                if (isReachable) {
+                    String.format("%s is reachable by neighbour %s. If CloudStack is failing to reach the respective host then it is probably a network issue between the host "
+                            + "and CloudStack management server.", host, neighbor);
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the host is healthy. The health-check is performed via HTTP GET request to a service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    public boolean isKvmHealthyCheckViaLibvirt(Host host) {
+        boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
+        if (!isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host))) {
+            return true;
+        }
+        return isKvmHaAgentHealthy;
+    }
+
+    /**
+     * Checks if the KVM HA webservice is enabled. One can enable or disable it via global settings 'kvm.ha.webservice.enabled'.
+     */
+    public boolean isKvmHaWebserviceEnabled(Host host) {
+        boolean isKvmHaWebserviceEnabled = KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+        if (!isKvmHaWebserviceEnabled) {
+            LOGGER.debug(String.format("Skipping KVM HA web-service verification for %s due to 'kvm.ha.webservice.enabled' not enabled.", host));
+            return false;
+        }
+        return true;

Review comment:
       @GutoVeronezi these slipped my eyes.
   Thanks a lot for bringing it. I will soon update the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-888583614


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-866908726


   @nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-888597505


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian. SL-JID 675


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634353753



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }

Review comment:
       As the method `listByHostAndStat` receive a varargs of state as parameter, we could simplify this method by joining the three first requests and filtering they if needed:
   
   ```java
   List<VMInstanceVO> listByHostAndStates = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running, VirtualMachine.State.Stopping, VirtualMachine.State.Migrating);
   
   if (LOGGER.isTraceEnabled()) {
       List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
       int startingVMs = listByHostAndStateStarting.size();
       int runningVMs = listByHostAndStates.stream().filter...;
       int stoppingVms = listByHostAndStates.stream().filter...;
       int migratingVms = listByHostAndStates.stream().filter...;
       int countRunningVmsOnAgent = countRunningVmsOnAgent();
       LOGGER.trace(
               String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
                       stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
   }
   
   return listByHostAndStates;
   ``` 

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -81,7 +98,63 @@ public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException
 
     @Override
     public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(r);
+        boolean isKvmHaWebserviceEnabled = isKvmHaWebserviceEnabled(r);
+
+        isHealthy = isHealthViaNfs(r);
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        //TODO

Review comment:
       What this comment means?

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -81,7 +98,63 @@ public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException
 
     @Override
     public boolean isHealthy(Host r) {

Review comment:
       We can rename the parameter to a more intuitive name.

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
+        int numberOfVmsOnAgent = countRunningVmsOnAgent();
+        if (numberOfVmsOnAgent < 0) {
+            LOGGER.error(String.format("KVM HA Agent health check failed, either the KVM Agent %s is unreachable or Libvirt validation failed.", agent));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        if (numberOfVmsOnHostAccordingToDb == numberOfVmsOnAgent) {
+            return true;
+        }
+        if (numberOfVmsOnAgent == 0 && numberOfVmsOnHostAccordingToDb > CAUTIOUS_MARGIN_OF_VMS_ON_HOST) {
+            // Return false as could not find VMs running but it expected at least one VM running, fencing/recovering host would avoid downtime to VMs in this case.
+            // There is cautious margin added on the conditional. This avoids fencing/recovering hosts when there is one VM migrating to a host that had zero VMs.
+            // If there are more VMs than the CAUTIOUS_MARGIN_OF_VMS_ON_HOST) the Host should be treated as not healthy and fencing/recovering process might be triggered.

Review comment:
       Javadoc already explain a little bit of the context, could we remove this comment and improve javadoc?

##########
File path: plugins/hypervisors/kvm/src/test/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClientTest.java
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.http.HttpEntity;
+import org.apache.http.HttpResponse;
+import org.apache.http.HttpStatus;
+import org.apache.http.ProtocolVersion;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.entity.InputStreamEntity;
+import org.apache.http.message.BasicStatusLine;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.mockito.Mock;
+import org.mockito.Mockito;
+import org.mockito.junit.MockitoJUnitRunner;
+
+import com.cloud.host.HostVO;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.dao.VMInstanceDaoImpl;
+import com.google.gson.JsonArray;
+import com.google.gson.JsonElement;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+
+@RunWith(MockitoJUnitRunner.class)
+public class KvmHaAgentClientTest {
+
+    private static final int ERROR_CODE = -1;
+    private HostVO agent = Mockito.mock(HostVO.class);
+    private KvmHaAgentClient kvmHaAgentClient = Mockito.spy(new KvmHaAgentClient(agent));
+    private static final int DEFAULT_PORT = 8080;
+    private static final String PRIVATE_IP_ADDRESS = "1.2.3.4";
+    private static final String JSON_STRING_EXAMPLE_3VMs = "{\"count\":3,\"virtualmachines\":[\"r-123-VM\",\"v-134-VM\",\"s-111-VM\"]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_3VMs = 3;
+    private static final String JSON_STRING_EXAMPLE_0VMs = "{\"count\":0,\"virtualmachines\":[]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_0VMs = 0;
+    private static final String EXPECTED_URL = String.format("http://%s:%d", PRIVATE_IP_ADDRESS, DEFAULT_PORT);
+    private static final HttpRequestBase HTTP_REQUEST_BASE = new HttpGet(EXPECTED_URL);
+    private static final String VMS_COUNT = "count";
+    private static final String VIRTUAL_MACHINES = "virtualmachines";
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int KVM_HA_WEBSERVICE_PORT = 8080;
+
+    @Mock
+    HttpClient client;
+
+    @Mock
+    VMInstanceDaoImpl vmInstanceDao;
+
+    @Test
+    public void isKvmHaAgentHealthyTestAllGood() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestVMsDoNotMatchButDoNotReturnFalse() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 1);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestExpectedRunningVmsButNoneListed() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 0);
+        Assert.assertFalse(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestReceivedErrorCode() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, ERROR_CODE);
+        Assert.assertFalse(result);
+    }
+
+    private boolean isKvmHaAgentHealthyTests(int expectedNumberOfVms, int vmsRunningOnAgent) {
+        List<VMInstanceVO> vmsOnHostList = new ArrayList<>();
+        for (int i = 0; i < expectedNumberOfVms; i++) {
+            VMInstanceVO vmInstance = Mockito.mock(VMInstanceVO.class);
+            vmsOnHostList.add(vmInstance);
+        }
+
+        Mockito.doReturn(vmsOnHostList).when(kvmHaAgentClient).listVmsOnHost(Mockito.any(), Mockito.any());
+        Mockito.doReturn(vmsRunningOnAgent).when(kvmHaAgentClient).countRunningVmsOnAgent();
+
+        return kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestNull() {
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(null);
+        Assert.assertNull(responseJson);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTest() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_3VMs, 3l);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestOtherJsonExample() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_0VMs, 0l);
+    }
+
+    private void prepareAndTestProcessHttpResponseIntoJson(String jsonString, long expectedVmsCount) throws IOException {
+        CloseableHttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, jsonString);
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(mockedResponse);
+
+        Assert.assertNotNull(responseJson);
+        JsonElement jsonElementVmsCount = responseJson.get(VMS_COUNT);
+        JsonElement jsonElementVmsArray = responseJson.get(VIRTUAL_MACHINES);
+        JsonArray jsonArray = jsonElementVmsArray.getAsJsonArray();
+
+        Assert.assertEquals(expectedVmsCount, jsonArray.size());
+        Assert.assertEquals(expectedVmsCount, jsonElementVmsCount.getAsLong());
+        Assert.assertEquals(jsonString, responseJson.toString());
+    }
+
+    private CloseableHttpResponse mockResponse(int httpStatusCode, String jsonString) throws IOException {
+        BasicStatusLine basicStatusLine = new BasicStatusLine(new ProtocolVersion("HTTP", 1000, 123), httpStatusCode, "Status");
+        CloseableHttpResponse response = Mockito.mock(CloseableHttpResponse.class);
+        InputStream in = IOUtils.toInputStream(jsonString, StandardCharsets.UTF_8);
+        Mockito.when(response.getStatusLine()).thenReturn(basicStatusLine);
+        HttpEntity httpEntity = new InputStreamEntity(in);
+        Mockito.when(response.getEntity()).thenReturn(httpEntity);
+        return response;
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTest() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTestBlankNoVmsListed() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_0VMs, EXPECTED_RUNNING_VMS_EXAMPLE_0VMs);
+    }
+
+    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+
+        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+
+        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+        Assert.assertEquals(expectedListedVms, result);
+    }
+
+    @Test
+    public void retryHttpRequestTest() throws IOException {
+        kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(1)).execute(Mockito.any());
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+    }
+
+    @Test
+    public void retryHttpRequestTestNullResponse() throws IOException {
+        Mockito.doReturn(null).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Assert.assertNull(response);
+    }
+
+    @Test
+    public void retryHttpRequestTestForbidden() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_FORBIDDEN, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestMultipleChoices() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_MULTIPLE_CHOICES, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestProcessing() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_PROCESSING, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestTimeout() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_GATEWAY_TIMEOUT, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestVersionNotSupported() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_HTTP_VERSION_NOT_SUPPORTED, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestOk() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_OK, false);
+    }
+
+    private void prepareAndRunRetryHttpRequestTest(int scMultipleChoices, boolean expectNull) throws IOException {
+        HttpResponse mockedResponse = mockResponse(scMultipleChoices, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        if (expectNull) {
+            Assert.assertNull(response);
+        } else {
+            Assert.assertEquals(mockedResponse, response);
+        }
+    }
+
+    @Test
+    public void retryHttpRequestTestHttpOk() throws IOException {
+        HttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse result = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        Assert.assertEquals(mockedResponse, result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestOneIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs));
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNotNull(result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestTwoIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenThrow(IOException.class);
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNull(result);
+    }
+
+    @Test
+    public void isKvmHaWebserviceEnabledTestDefault() {
+        Assert.assertFalse(kvmHaAgentClient.isKvmHaWebserviceEnabled());
+    }
+
+    @Test
+    public void getKvmHaMicroservicePortValueTestDefault() {
+        Assert.assertEquals(KVM_HA_WEBSERVICE_PORT, kvmHaAgentClient.getKvmHaMicroservicePortValue());
+    }
+
+//    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+//        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+//        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+//
+//        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+//        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+//
+//        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+//        Assert.assertEquals(expectedListedVms, result);
+//    }
+//TODO
+//    @Test
+//    public void isTargetHostReachableTest() {
+//        kvmHaAgentClient.isTargetHostReachable(PRIVATE_IP_ADDRESS);
+//    }

Review comment:
       Do we need commented code?

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
+        int numberOfVmsOnAgent = countRunningVmsOnAgent();
+        if (numberOfVmsOnAgent < 0) {
+            LOGGER.error(String.format("KVM HA Agent health check failed, either the KVM Agent %s is unreachable or Libvirt validation failed.", agent));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        if (numberOfVmsOnHostAccordingToDb == numberOfVmsOnAgent) {
+            return true;
+        }
+        if (numberOfVmsOnAgent == 0 && numberOfVmsOnHostAccordingToDb > CAUTIOUS_MARGIN_OF_VMS_ON_HOST) {
+            // Return false as could not find VMs running but it expected at least one VM running, fencing/recovering host would avoid downtime to VMs in this case.
+            // There is cautious margin added on the conditional. This avoids fencing/recovering hosts when there is one VM migrating to a host that had zero VMs.
+            // If there are more VMs than the CAUTIOUS_MARGIN_OF_VMS_ON_HOST) the Host should be treated as not healthy and fencing/recovering process might be triggered.
+            LOGGER.warn(String.format("KVM HA Agent %s could not find VMs; it was expected to list %d VMs.", agent, numberOfVmsOnHostAccordingToDb));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        // In order to have a less "aggressive" health-check, the KvmHaAgentClient will not return false; fencing/recovering could bring downtime to existing VMs
+        // Additionally, the inconsistency can also be due to jobs in progress to migrate/stop/start VMs
+        // Either way, WARN messages should be presented to Admins so they can look closely to what is happening on the host
+        LOGGER.warn(String.format("KVM HA Agent %s listed %d VMs; however, it was expected %d VMs.", agent, numberOfVmsOnAgent, numberOfVmsOnHostAccordingToDb));
+        return true;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;
+    }
+
+    /**
+     * Re-executes the HTTP GET request until it gets a response or it reaches the maximum request retries {@link #MAX_REQUEST_RETRIES}
+     */
+    protected HttpResponse retryHttpRequest(String url, HttpRequestBase httpReq, HttpClient client) {
+        LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s]. Executing the request again.", httpReq.getMethod(), url));
+        HttpResponse response = retryUntilGetsHttpResponse(url, httpReq, client);
+
+        if (response == null) {
+            LOGGER.error(String.format("Failed to execute HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+            return response;
+        }
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES) {
+            LOGGER.error(
+                    String.format("Failed to get VMs information with a %s request to URL '%s'. The expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url,
+                            EXPECTED_HTTP_STATUS, statusCode));
+            return null;
+        }
+
+        LOGGER.debug(String.format("Successfully executed HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+        return response;
+    }
+
+    protected HttpResponse retryUntilGetsHttpResponse(String url, HttpRequestBase httpReq, HttpClient client) {
+        for (int attempt = 1; attempt < MAX_REQUEST_RETRIES + 1; attempt++) {

Review comment:
       ```suggestion
           for (int attempt = 1; attempt <= MAX_REQUEST_RETRIES; attempt++) {
   ```

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
+        int numberOfVmsOnAgent = countRunningVmsOnAgent();
+        if (numberOfVmsOnAgent < 0) {
+            LOGGER.error(String.format("KVM HA Agent health check failed, either the KVM Agent %s is unreachable or Libvirt validation failed.", agent));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        if (numberOfVmsOnHostAccordingToDb == numberOfVmsOnAgent) {
+            return true;
+        }
+        if (numberOfVmsOnAgent == 0 && numberOfVmsOnHostAccordingToDb > CAUTIOUS_MARGIN_OF_VMS_ON_HOST) {
+            // Return false as could not find VMs running but it expected at least one VM running, fencing/recovering host would avoid downtime to VMs in this case.
+            // There is cautious margin added on the conditional. This avoids fencing/recovering hosts when there is one VM migrating to a host that had zero VMs.
+            // If there are more VMs than the CAUTIOUS_MARGIN_OF_VMS_ON_HOST) the Host should be treated as not healthy and fencing/recovering process might be triggered.
+            LOGGER.warn(String.format("KVM HA Agent %s could not find VMs; it was expected to list %d VMs.", agent, numberOfVmsOnHostAccordingToDb));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        // In order to have a less "aggressive" health-check, the KvmHaAgentClient will not return false; fencing/recovering could bring downtime to existing VMs
+        // Additionally, the inconsistency can also be due to jobs in progress to migrate/stop/start VMs
+        // Either way, WARN messages should be presented to Admins so they can look closely to what is happening on the host
+        LOGGER.warn(String.format("KVM HA Agent %s listed %d VMs; however, it was expected %d VMs.", agent, numberOfVmsOnAgent, numberOfVmsOnHostAccordingToDb));
+        return true;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;
+    }
+
+    /**
+     * Re-executes the HTTP GET request until it gets a response or it reaches the maximum request retries {@link #MAX_REQUEST_RETRIES}
+     */
+    protected HttpResponse retryHttpRequest(String url, HttpRequestBase httpReq, HttpClient client) {
+        LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s]. Executing the request again.", httpReq.getMethod(), url));
+        HttpResponse response = retryUntilGetsHttpResponse(url, httpReq, client);
+
+        if (response == null) {
+            LOGGER.error(String.format("Failed to execute HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+            return response;
+        }
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES) {
+            LOGGER.error(
+                    String.format("Failed to get VMs information with a %s request to URL '%s'. The expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url,
+                            EXPECTED_HTTP_STATUS, statusCode));
+            return null;
+        }
+
+        LOGGER.debug(String.format("Successfully executed HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+        return response;
+    }
+
+    protected HttpResponse retryUntilGetsHttpResponse(String url, HttpRequestBase httpReq, HttpClient client) {
+        for (int attempt = 1; attempt < MAX_REQUEST_RETRIES + 1; attempt++) {
+            try {
+                TimeUnit.SECONDS.sleep(WAIT_FOR_REQUEST_RETRY);
+                LOGGER.debug(String.format("Retry HTTP %s request [URL: %s], attempt %d/%d.", httpReq.getMethod(), url, attempt, MAX_REQUEST_RETRIES));
+                return client.execute(httpReq);
+            } catch (IOException | InterruptedException e) {
+                String errorMessage = String.format("Failed to execute HTTP %s request retry attempt %d/%d [URL: %s] due to exception %s",
+                        httpReq.getMethod(), attempt, MAX_REQUEST_RETRIES, url, e);
+                LOGGER.error(errorMessage);
+            }
+        }
+        return null;
+    }
+
+    /**
+     * Processes the response of request GET System ID as a JSON object.<br>
+     * Json example: {"count": 3, "virtualmachines": ["r-123-VM", "v-134-VM", "s-111-VM"]}<br><br>
+     *
+     * Note: this method can return NULL JsonObject in case HttpResponse is NULL.
+     */
+    protected JsonObject processHttpResponseIntoJson(HttpResponse response) {
+        InputStream in;
+        String jsonString;
+        if (response == null) {
+            return null;
+        }
+        try {
+            in = response.getEntity().getContent();
+            BufferedReader streamReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
+            jsonString = streamReader.readLine();
+        } catch (UnsupportedOperationException | IOException e) {
+            throw new CloudRuntimeException("Failed to process response", e);
+        }
+
+        return new JsonParser().parse(jsonString).getAsJsonObject();

Review comment:
       If needed, could we improve the exception with some `response` context?

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
+        int numberOfVmsOnAgent = countRunningVmsOnAgent();
+        if (numberOfVmsOnAgent < 0) {
+            LOGGER.error(String.format("KVM HA Agent health check failed, either the KVM Agent %s is unreachable or Libvirt validation failed.", agent));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        if (numberOfVmsOnHostAccordingToDb == numberOfVmsOnAgent) {
+            return true;
+        }
+        if (numberOfVmsOnAgent == 0 && numberOfVmsOnHostAccordingToDb > CAUTIOUS_MARGIN_OF_VMS_ON_HOST) {
+            // Return false as could not find VMs running but it expected at least one VM running, fencing/recovering host would avoid downtime to VMs in this case.
+            // There is cautious margin added on the conditional. This avoids fencing/recovering hosts when there is one VM migrating to a host that had zero VMs.
+            // If there are more VMs than the CAUTIOUS_MARGIN_OF_VMS_ON_HOST) the Host should be treated as not healthy and fencing/recovering process might be triggered.
+            LOGGER.warn(String.format("KVM HA Agent %s could not find VMs; it was expected to list %d VMs.", agent, numberOfVmsOnHostAccordingToDb));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        // In order to have a less "aggressive" health-check, the KvmHaAgentClient will not return false; fencing/recovering could bring downtime to existing VMs
+        // Additionally, the inconsistency can also be due to jobs in progress to migrate/stop/start VMs
+        // Either way, WARN messages should be presented to Admins so they can look closely to what is happening on the host
+        LOGGER.warn(String.format("KVM HA Agent %s listed %d VMs; however, it was expected %d VMs.", agent, numberOfVmsOnAgent, numberOfVmsOnHostAccordingToDb));
+        return true;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;
+    }
+
+    /**
+     * Re-executes the HTTP GET request until it gets a response or it reaches the maximum request retries {@link #MAX_REQUEST_RETRIES}
+     */
+    protected HttpResponse retryHttpRequest(String url, HttpRequestBase httpReq, HttpClient client) {
+        LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s]. Executing the request again.", httpReq.getMethod(), url));
+        HttpResponse response = retryUntilGetsHttpResponse(url, httpReq, client);
+
+        if (response == null) {
+            LOGGER.error(String.format("Failed to execute HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+            return response;
+        }
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES) {
+            LOGGER.error(
+                    String.format("Failed to get VMs information with a %s request to URL '%s'. The expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url,
+                            EXPECTED_HTTP_STATUS, statusCode));
+            return null;
+        }
+
+        LOGGER.debug(String.format("Successfully executed HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+        return response;
+    }
+
+    protected HttpResponse retryUntilGetsHttpResponse(String url, HttpRequestBase httpReq, HttpClient client) {
+        for (int attempt = 1; attempt < MAX_REQUEST_RETRIES + 1; attempt++) {
+            try {
+                TimeUnit.SECONDS.sleep(WAIT_FOR_REQUEST_RETRY);
+                LOGGER.debug(String.format("Retry HTTP %s request [URL: %s], attempt %d/%d.", httpReq.getMethod(), url, attempt, MAX_REQUEST_RETRIES));
+                return client.execute(httpReq);
+            } catch (IOException | InterruptedException e) {
+                String errorMessage = String.format("Failed to execute HTTP %s request retry attempt %d/%d [URL: %s] due to exception %s",
+                        httpReq.getMethod(), attempt, MAX_REQUEST_RETRIES, url, e);
+                LOGGER.error(errorMessage);
+            }
+        }
+        return null;
+    }
+
+    /**
+     * Processes the response of request GET System ID as a JSON object.<br>
+     * Json example: {"count": 3, "virtualmachines": ["r-123-VM", "v-134-VM", "s-111-VM"]}<br><br>
+     *
+     * Note: this method can return NULL JsonObject in case HttpResponse is NULL.
+     */
+    protected JsonObject processHttpResponseIntoJson(HttpResponse response) {
+        InputStream in;
+        String jsonString;
+        if (response == null) {
+            return null;
+        }
+        try {
+            in = response.getEntity().getContent();
+            BufferedReader streamReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
+            jsonString = streamReader.readLine();
+        } catch (UnsupportedOperationException | IOException e) {
+            throw new CloudRuntimeException("Failed to process response", e);
+        }
+
+        return new JsonParser().parse(jsonString).getAsJsonObject();

Review comment:
       ```suggestion
           if (response == null) {
               return null;
           }
   
           try {
               InputStream in = response.getEntity().getContent();
               BufferedReader streamReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
               String jsonString = streamReader.readLine();
               return new JsonParser().parse(jsonString).getAsJsonObject();
           } catch (UnsupportedOperationException | IOException e) {
               throw new CloudRuntimeException("Failed to process response", e);
           }
   ```

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
+        int numberOfVmsOnAgent = countRunningVmsOnAgent();
+        if (numberOfVmsOnAgent < 0) {
+            LOGGER.error(String.format("KVM HA Agent health check failed, either the KVM Agent %s is unreachable or Libvirt validation failed.", agent));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        if (numberOfVmsOnHostAccordingToDb == numberOfVmsOnAgent) {
+            return true;
+        }
+        if (numberOfVmsOnAgent == 0 && numberOfVmsOnHostAccordingToDb > CAUTIOUS_MARGIN_OF_VMS_ON_HOST) {
+            // Return false as could not find VMs running but it expected at least one VM running, fencing/recovering host would avoid downtime to VMs in this case.
+            // There is cautious margin added on the conditional. This avoids fencing/recovering hosts when there is one VM migrating to a host that had zero VMs.
+            // If there are more VMs than the CAUTIOUS_MARGIN_OF_VMS_ON_HOST) the Host should be treated as not healthy and fencing/recovering process might be triggered.
+            LOGGER.warn(String.format("KVM HA Agent %s could not find VMs; it was expected to list %d VMs.", agent, numberOfVmsOnHostAccordingToDb));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        // In order to have a less "aggressive" health-check, the KvmHaAgentClient will not return false; fencing/recovering could bring downtime to existing VMs
+        // Additionally, the inconsistency can also be due to jobs in progress to migrate/stop/start VMs
+        // Either way, WARN messages should be presented to Admins so they can look closely to what is happening on the host
+        LOGGER.warn(String.format("KVM HA Agent %s listed %d VMs; however, it was expected %d VMs.", agent, numberOfVmsOnAgent, numberOfVmsOnHostAccordingToDb));
+        return true;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;
+    }
+
+    /**
+     * Re-executes the HTTP GET request until it gets a response or it reaches the maximum request retries {@link #MAX_REQUEST_RETRIES}
+     */
+    protected HttpResponse retryHttpRequest(String url, HttpRequestBase httpReq, HttpClient client) {
+        LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s]. Executing the request again.", httpReq.getMethod(), url));
+        HttpResponse response = retryUntilGetsHttpResponse(url, httpReq, client);
+
+        if (response == null) {
+            LOGGER.error(String.format("Failed to execute HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+            return response;
+        }
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES) {
+            LOGGER.error(
+                    String.format("Failed to get VMs information with a %s request to URL '%s'. The expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url,
+                            EXPECTED_HTTP_STATUS, statusCode));
+            return null;
+        }
+
+        LOGGER.debug(String.format("Successfully executed HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+        return response;
+    }
+
+    protected HttpResponse retryUntilGetsHttpResponse(String url, HttpRequestBase httpReq, HttpClient client) {
+        for (int attempt = 1; attempt < MAX_REQUEST_RETRIES + 1; attempt++) {
+            try {
+                TimeUnit.SECONDS.sleep(WAIT_FOR_REQUEST_RETRY);
+                LOGGER.debug(String.format("Retry HTTP %s request [URL: %s], attempt %d/%d.", httpReq.getMethod(), url, attempt, MAX_REQUEST_RETRIES));
+                return client.execute(httpReq);
+            } catch (IOException | InterruptedException e) {
+                String errorMessage = String.format("Failed to execute HTTP %s request retry attempt %d/%d [URL: %s] due to exception %s",
+                        httpReq.getMethod(), attempt, MAX_REQUEST_RETRIES, url, e);
+                LOGGER.error(errorMessage);
+            }
+        }
+        return null;
+    }
+
+    /**
+     * Processes the response of request GET System ID as a JSON object.<br>
+     * Json example: {"count": 3, "virtualmachines": ["r-123-VM", "v-134-VM", "s-111-VM"]}<br><br>
+     *
+     * Note: this method can return NULL JsonObject in case HttpResponse is NULL.
+     */
+    protected JsonObject processHttpResponseIntoJson(HttpResponse response) {
+        InputStream in;
+        String jsonString;
+        if (response == null) {
+            return null;
+        }
+        try {
+            in = response.getEntity().getContent();
+            BufferedReader streamReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
+            jsonString = streamReader.readLine();
+        } catch (UnsupportedOperationException | IOException e) {
+            throw new CloudRuntimeException("Failed to process response", e);
+        }
+
+        return new JsonParser().parse(jsonString).getAsJsonObject();

Review comment:
       Seems to me that this method will be called several times, should we extract this `new JsonParser()` to a constant?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-849658265


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_check_mark: debian. SL-JID 103


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] div8cn commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
div8cn commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-836211846


   HI, I see this PR is very good.
   
   I have a question, if the manager network link is interrupted (such as a switch failure), but the storage network of kvm is still running normally.
   
   Will this cause the VM to trigger HA?
   
   I am worried that the image is "double written", resulting in damage to the image


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-866954522


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_check_mark: debian. SL-JID 340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r666435432



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -77,57 +77,84 @@ public Status isAgentAlive(Host agent) {
             return haManager.getHostStatus(agent);
         }
 
-        List<StoragePoolVO> clusterPools = _storagePoolDao.listPoolsByCluster(agent.getClusterId());
-        boolean hasNfs = false;
-        for (StoragePoolVO pool : clusterPools) {
-            if (pool.getPoolType() == StoragePoolType.NetworkFilesystem) {
-                hasNfs = true;
-                break;
-            }
+        Status agentStatus = Status.Disconnected;
+        boolean hasNfs = isHostServedByNfsPool(agent);
+        if (hasNfs) {
+            agentStatus = checkAgentStatusViaNfs(agent);
+            s_logger.debug(String.format("Agent investigation was requested on host %s. Agent status via NFS heartbeat is %s.", agent, agentStatus));
+        } else {
+            s_logger.debug(String.format("Agent investigation was requested on host %s, but host has no NFS storage. Skipping investigation via NFS.", agent));
+        }
+
+        boolean isKvmHaWebserviceEnabled = kvmHaHelper.isKvmHaWebserviceEnabled(agent);
+        if (isKvmHaWebserviceEnabled) {
+            agentStatus = kvmHaHelper.checkAgentStatusViaKvmHaAgent(agent, agentStatus);
         }
+
+        return agentStatus;
+    }
+
+    private boolean isHostServedByNfsPool(Host agent) {
+        boolean hasNfs = hasNfsPoolClusterWideForHost(agent);
         if (!hasNfs) {
-            List<StoragePoolVO> zonePools = _storagePoolDao.findZoneWideStoragePoolsByHypervisor(agent.getDataCenterId(), agent.getHypervisorType());
-            for (StoragePoolVO pool : zonePools) {
-                if (pool.getPoolType() == StoragePoolType.NetworkFilesystem) {
-                    hasNfs = true;
-                    break;
-                }
+            hasNfs = hasNfsPoolZoneWideForHost(agent);
+        }
+        return hasNfs;
+    }
+
+    private boolean hasNfsPoolZoneWideForHost(Host agent) {
+        List<StoragePoolVO> zonePools = _storagePoolDao.findZoneWideStoragePoolsByHypervisor(agent.getDataCenterId(), agent.getHypervisorType());
+        for (StoragePoolVO pool : zonePools) {
+            if (pool.getPoolType() == StoragePoolType.NetworkFilesystem) {
+                return true;
             }
         }
-        if (!hasNfs) {
-            s_logger.warn(
-                    "Agent investigation was requested on host " + agent + ", but host does not support investigation because it has no NFS storage. Skipping investigation.");
-            return Status.Disconnected;
+        return false;

Review comment:
       We could use `stream` methods to simplify this implementation:
   
   ```java
   return zonePools.stream().anyMatch(pool -> pool.getPoolType() == StoragePoolType.NetworkFilesystem);
   ``` 
   
   Also, this code is repeated right below in `hasNfsPoolClusterWideForHost`, we could extract it to a method and add unit tests. 

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -213,6 +270,17 @@ protected boolean verifyActivityOfStorageOnHost(HashMap<StoragePool, List<Volume
         return poolVolMap;
     }
 
+    private boolean isHostServedByNfsPool(Host agent) {
+        List<StoragePoolHostVO> storagesOnHost = storagePoolHostDao.listByHostId(agent.getId());
+        for (StoragePoolHostVO storagePoolHostRef : storagesOnHost) {
+            StoragePoolVO storagePool = this.storagePool.findById(storagePoolHostRef.getPoolId());
+            if (NFS_POOL_TYPE.contains(storagePool.getPoolType())) {
+                return true;
+            }
+        }
+        return false;

Review comment:
       We could use `stream().anyMatch` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] NuxRo commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
NuxRo commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-926583559


   @wido Thanks for the explanation. Ok, so I'll want to do some more testing with non-nfs storage, see how it goes.
   
   BTW, are URLs such as "http://10.0.33.2:8080/check-neighbour/10.0.34.165:8080;" supposed to always return "Down" despite them being up? 
   Both my two HVs above do this, but eg http://10.0.33.2:8080 will happily report the running VMs and querying them from each other with curl also yields the expected results. I'm on Ubuntu 20.04.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] wido commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
wido commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-926556682


   > Ok, a couple of questions:
   > 
   > 1 - as Rohit asked, why can't we do the checks over SSH instead of running a separate service which might have security and other implications? Hypervisors are already expected to accept incoming SSH from the management server(s).
   
   No, we can't expect management servers always to SSH into the hypervisors. We for example have disabled this and the mgmt server can't SSH into the HV.
   
   This helper is there for the case that the CloudStack Agent crashes (for what reason) and the host is disconnected. Via this helper agent we have a second way of checking if the Host is still alive and has running VMs.
   
   If this helper does not respond either we can then fence off the host via OOB (IPMI or Redfish).
   
   Once the Host is Fenced we can safely start the VMs on a different host.
   
   > 2 - It's not 100% clear, people will still need to rely on the old NFS HA method, right? Your work merely adds additional checks. Any way we can get rid of NFS at this point?
   
   This additional method allows us to also provide HA when NOT using NFS. For example when using Ceph/RBD only.
   
   It is very, very, very much recommended to always use IPMI/Redfish with HA so that Hosts can be fenced properly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-919710746






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] sureshanaparti commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
sureshanaparti commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634364392



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -81,7 +98,63 @@ public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException
 
     @Override
     public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(r);
+        boolean isKvmHaWebserviceEnabled = isKvmHaWebserviceEnabled(r);
+
+        isHealthy = isHealthViaNfs(r);
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        //TODO
+
+
+        if (isVmActivtyOnHostViaKvmHaWebservice(r) && !isHealthy) {
+            isHealthy = true;
+        }
+
+        return isHealthy;
+    }
+
+    /**
+     * Checks the host health via an web-service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    private boolean isVmActivtyOnHostViaKvmHaWebservice(Host host) {
+        KvmHaAgentClient kvmHaAgentClient = new KvmHaAgentClient(host);
+        return kvmHaAgentClient.isKvmHaAgentHealthy(host, vmInstanceDao);
+    }
+
+    //TODO

Review comment:
       empty TODO comment ^^^




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1005151790


   @sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-888274981


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-849617919


   @blueorangutan package 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634407103



##########
File path: plugins/hypervisors/kvm/src/test/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClientTest.java
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.http.HttpEntity;
+import org.apache.http.HttpResponse;
+import org.apache.http.HttpStatus;
+import org.apache.http.ProtocolVersion;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.entity.InputStreamEntity;
+import org.apache.http.message.BasicStatusLine;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.mockito.Mock;
+import org.mockito.Mockito;
+import org.mockito.junit.MockitoJUnitRunner;
+
+import com.cloud.host.HostVO;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.dao.VMInstanceDaoImpl;
+import com.google.gson.JsonArray;
+import com.google.gson.JsonElement;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+
+@RunWith(MockitoJUnitRunner.class)
+public class KvmHaAgentClientTest {
+
+    private static final int ERROR_CODE = -1;
+    private HostVO agent = Mockito.mock(HostVO.class);
+    private KvmHaAgentClient kvmHaAgentClient = Mockito.spy(new KvmHaAgentClient(agent));
+    private static final int DEFAULT_PORT = 8080;
+    private static final String PRIVATE_IP_ADDRESS = "1.2.3.4";
+    private static final String JSON_STRING_EXAMPLE_3VMs = "{\"count\":3,\"virtualmachines\":[\"r-123-VM\",\"v-134-VM\",\"s-111-VM\"]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_3VMs = 3;
+    private static final String JSON_STRING_EXAMPLE_0VMs = "{\"count\":0,\"virtualmachines\":[]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_0VMs = 0;
+    private static final String EXPECTED_URL = String.format("http://%s:%d", PRIVATE_IP_ADDRESS, DEFAULT_PORT);
+    private static final HttpRequestBase HTTP_REQUEST_BASE = new HttpGet(EXPECTED_URL);
+    private static final String VMS_COUNT = "count";
+    private static final String VIRTUAL_MACHINES = "virtualmachines";
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int KVM_HA_WEBSERVICE_PORT = 8080;
+
+    @Mock
+    HttpClient client;
+
+    @Mock
+    VMInstanceDaoImpl vmInstanceDao;
+
+    @Test
+    public void isKvmHaAgentHealthyTestAllGood() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestVMsDoNotMatchButDoNotReturnFalse() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 1);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestExpectedRunningVmsButNoneListed() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 0);
+        Assert.assertFalse(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestReceivedErrorCode() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, ERROR_CODE);
+        Assert.assertFalse(result);
+    }
+
+    private boolean isKvmHaAgentHealthyTests(int expectedNumberOfVms, int vmsRunningOnAgent) {
+        List<VMInstanceVO> vmsOnHostList = new ArrayList<>();
+        for (int i = 0; i < expectedNumberOfVms; i++) {
+            VMInstanceVO vmInstance = Mockito.mock(VMInstanceVO.class);
+            vmsOnHostList.add(vmInstance);
+        }
+
+        Mockito.doReturn(vmsOnHostList).when(kvmHaAgentClient).listVmsOnHost(Mockito.any(), Mockito.any());
+        Mockito.doReturn(vmsRunningOnAgent).when(kvmHaAgentClient).countRunningVmsOnAgent();
+
+        return kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestNull() {
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(null);
+        Assert.assertNull(responseJson);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTest() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_3VMs, 3l);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestOtherJsonExample() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_0VMs, 0l);
+    }
+
+    private void prepareAndTestProcessHttpResponseIntoJson(String jsonString, long expectedVmsCount) throws IOException {
+        CloseableHttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, jsonString);
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(mockedResponse);
+
+        Assert.assertNotNull(responseJson);
+        JsonElement jsonElementVmsCount = responseJson.get(VMS_COUNT);
+        JsonElement jsonElementVmsArray = responseJson.get(VIRTUAL_MACHINES);
+        JsonArray jsonArray = jsonElementVmsArray.getAsJsonArray();
+
+        Assert.assertEquals(expectedVmsCount, jsonArray.size());
+        Assert.assertEquals(expectedVmsCount, jsonElementVmsCount.getAsLong());
+        Assert.assertEquals(jsonString, responseJson.toString());
+    }
+
+    private CloseableHttpResponse mockResponse(int httpStatusCode, String jsonString) throws IOException {
+        BasicStatusLine basicStatusLine = new BasicStatusLine(new ProtocolVersion("HTTP", 1000, 123), httpStatusCode, "Status");
+        CloseableHttpResponse response = Mockito.mock(CloseableHttpResponse.class);
+        InputStream in = IOUtils.toInputStream(jsonString, StandardCharsets.UTF_8);
+        Mockito.when(response.getStatusLine()).thenReturn(basicStatusLine);
+        HttpEntity httpEntity = new InputStreamEntity(in);
+        Mockito.when(response.getEntity()).thenReturn(httpEntity);
+        return response;
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTest() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTestBlankNoVmsListed() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_0VMs, EXPECTED_RUNNING_VMS_EXAMPLE_0VMs);
+    }
+
+    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+
+        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+
+        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+        Assert.assertEquals(expectedListedVms, result);
+    }
+
+    @Test
+    public void retryHttpRequestTest() throws IOException {
+        kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(1)).execute(Mockito.any());
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+    }
+
+    @Test
+    public void retryHttpRequestTestNullResponse() throws IOException {
+        Mockito.doReturn(null).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Assert.assertNull(response);
+    }
+
+    @Test
+    public void retryHttpRequestTestForbidden() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_FORBIDDEN, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestMultipleChoices() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_MULTIPLE_CHOICES, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestProcessing() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_PROCESSING, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestTimeout() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_GATEWAY_TIMEOUT, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestVersionNotSupported() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_HTTP_VERSION_NOT_SUPPORTED, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestOk() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_OK, false);
+    }
+
+    private void prepareAndRunRetryHttpRequestTest(int scMultipleChoices, boolean expectNull) throws IOException {
+        HttpResponse mockedResponse = mockResponse(scMultipleChoices, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        if (expectNull) {
+            Assert.assertNull(response);
+        } else {
+            Assert.assertEquals(mockedResponse, response);
+        }
+    }
+
+    @Test
+    public void retryHttpRequestTestHttpOk() throws IOException {
+        HttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse result = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        Assert.assertEquals(mockedResponse, result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestOneIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs));
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNotNull(result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestTwoIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenThrow(IOException.class);
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNull(result);
+    }
+
+    @Test
+    public void isKvmHaWebserviceEnabledTestDefault() {
+        Assert.assertFalse(kvmHaAgentClient.isKvmHaWebserviceEnabled());
+    }
+
+    @Test
+    public void getKvmHaMicroservicePortValueTestDefault() {
+        Assert.assertEquals(KVM_HA_WEBSERVICE_PORT, kvmHaAgentClient.getKvmHaMicroservicePortValue());
+    }
+
+//    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+//        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+//        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+//
+//        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+//        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+//
+//        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+//        Assert.assertEquals(expectedListedVms, result);
+//    }
+//TODO
+//    @Test
+//    public void isTargetHostReachableTest() {
+//        kvmHaAgentClient.isTargetHostReachable(PRIVATE_IP_ADDRESS);
+//    }

Review comment:
       @GutoVeronezi No, we don't. Sorry, forgot to convert for a draft. I am adding some checks to enhance the HA validation.
   
   The idea here is to verify if the Suspect Host is reachable by other hosts on the cluster, therefore excluding network issues between mgmt and kvm agents and avoiding fencing / rebooting in such cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063215620


   @nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899843852


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian. SL-JID 886


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-919711286


   @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-863660571






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-849639413


   @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd closed pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd closed pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-867459456


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_check_mark: debian. SL-JID 352


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd closed pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd closed pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r689693904



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        double acceptedProblematicHostsRatio = KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.valueIn(host.getClusterId());
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * acceptedProblematicHostsRatio);
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {
+                boolean isReachable = kvmHaAgentClient.isHostReachableByNeighbour(neighbor, host);
+                if (isReachable) {
+                    String.format("%s is reachable by neighbour %s. If CloudStack is failing to reach the respective host then it is probably a network issue between the host "
+                            + "and CloudStack management server.", host, neighbor);
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the host is healthy. The health-check is performed via HTTP GET request to a service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    public boolean isKvmHealthyCheckViaLibvirt(Host host) {
+        boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
+        if (!isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host))) {
+            return true;
+        }
+        return isKvmHaAgentHealthy;
+    }
+
+    /**
+     * Checks if the KVM HA webservice is enabled. One can enable or disable it via global settings 'kvm.ha.webservice.enabled'.
+     */
+    public boolean isKvmHaWebserviceEnabled(Host host) {
+        boolean isKvmHaWebserviceEnabled = KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+        if (!isKvmHaWebserviceEnabled) {
+            LOGGER.debug(String.format("Skipping KVM HA web-service verification for %s due to 'kvm.ha.webservice.enabled' not enabled.", host));
+            return false;
+        }
+        return true;

Review comment:
       We could remove both returns and just add a `return isKvmHaWebserviceEnabled;` at the end.

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        double acceptedProblematicHostsRatio = KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.valueIn(host.getClusterId());
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * acceptedProblematicHostsRatio);
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {

Review comment:
       We could invert this if to reduce indentation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r831716668



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAConfig.java
##########
@@ -53,4 +53,32 @@
     public static final ConfigKey<Long> KvmHAFenceTimeout = new ConfigKey<>("Advanced", Long.class, "kvm.ha.fence.timeout", "60",
             "The maximum length of time, in seconds, expected for a fence operation to complete.", true, ConfigKey.Scope.Cluster);
 
+    public static final ConfigKey<Integer> KvmHaWebservicePort = new ConfigKey<Integer>("Advanced", Integer.class, "kvm.ha.webservice.port", "8443",
+            "It sets the port used to communicate with the KVM HA Agent Microservice that is running on KVM nodes. Default value is 8443.",
+            true, ConfigKey.Scope.Cluster);
+
+    public static final ConfigKey<Boolean> IsKvmHaWebserviceEnabled = new ConfigKey<Boolean>("Advanced", Boolean.class, "kvm.ha.webservice.enabled", "false",

Review comment:
       @GabrielBrascher thanks, maybe it can be simply renamed to `kvm.ha.webservice.check.enabled` or similar?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1056957114


   @nvazquez I am just running a few tests in order to ensure I am addressing the review from @rohityadavcloud and @PaulAngus.
   I will be adding HTTPs + Basic authentication to enhance the security of who is able to retrieve the listed VMs from the hosts.
   
   I hope to have it back to "ready for review" status in 5-10 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] wido commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
wido commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r627371726



##########
File path: packaging/systemd/cloudstack-agent-ha-helper.service
##########
@@ -0,0 +1,36 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Do not modify this file as your changes will be lost in the next CSM update.
+# If you need to add specific dependencies to this service unit do it in the
+# /etc/systemd/system/cloudstack-management.service.d/ directory
+
+[Unit]
+Description=CloudStack Agent HA Helper
+Documentation=http://www.cloudstack.org/
+Requires=libvirtd.service
+After=libvirtd.service
+
+[Service]
+Type=simple
+EnvironmentFile=/etc/default/cloudstack-agent-ha-helper

Review comment:
       This will not work on RHEL as that uses:
   
   <pre>/etc/sysconfig</pre>
   
   For the configuration. We need to have something similar like for the cloudstack-agent package where we have different systemd files for CentOS and Ubuntu.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r643946838



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       Thanks for bringing it up @GutoVeronezi, these are pertinent questions.
   
   This is a perfect case of _Pokémon catch_ (catch them all). To be honest, I cannot even find how an Exception could be raised by `AgentManagementImpl.easySend` as the method itself handles `Exception`.
   
   I was inclined to simply remove it; however, I avoided changing any behavior of the HeartBeat execution flow avoiding any regression issue.
   
   I will take a double-check at this `easySend` catch.
   But it is a good thing to pay attention to, if I do not change it at this PR we can always plan another PR/issue cleaning some of these catches.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-866908273


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-914946508


   @blueorangutan test 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-897071665


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] sureshanaparti commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
sureshanaparti commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634369401



##########
File path: plugins/hypervisors/kvm/src/test/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClientTest.java
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.http.HttpEntity;
+import org.apache.http.HttpResponse;
+import org.apache.http.HttpStatus;
+import org.apache.http.ProtocolVersion;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.entity.InputStreamEntity;
+import org.apache.http.message.BasicStatusLine;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.mockito.Mock;
+import org.mockito.Mockito;
+import org.mockito.junit.MockitoJUnitRunner;
+
+import com.cloud.host.HostVO;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.dao.VMInstanceDaoImpl;
+import com.google.gson.JsonArray;
+import com.google.gson.JsonElement;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+
+@RunWith(MockitoJUnitRunner.class)
+public class KvmHaAgentClientTest {
+
+    private static final int ERROR_CODE = -1;
+    private HostVO agent = Mockito.mock(HostVO.class);
+    private KvmHaAgentClient kvmHaAgentClient = Mockito.spy(new KvmHaAgentClient(agent));
+    private static final int DEFAULT_PORT = 8080;
+    private static final String PRIVATE_IP_ADDRESS = "1.2.3.4";
+    private static final String JSON_STRING_EXAMPLE_3VMs = "{\"count\":3,\"virtualmachines\":[\"r-123-VM\",\"v-134-VM\",\"s-111-VM\"]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_3VMs = 3;
+    private static final String JSON_STRING_EXAMPLE_0VMs = "{\"count\":0,\"virtualmachines\":[]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_0VMs = 0;
+    private static final String EXPECTED_URL = String.format("http://%s:%d", PRIVATE_IP_ADDRESS, DEFAULT_PORT);
+    private static final HttpRequestBase HTTP_REQUEST_BASE = new HttpGet(EXPECTED_URL);
+    private static final String VMS_COUNT = "count";
+    private static final String VIRTUAL_MACHINES = "virtualmachines";
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int KVM_HA_WEBSERVICE_PORT = 8080;
+
+    @Mock
+    HttpClient client;
+
+    @Mock
+    VMInstanceDaoImpl vmInstanceDao;
+
+    @Test
+    public void isKvmHaAgentHealthyTestAllGood() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestVMsDoNotMatchButDoNotReturnFalse() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 1);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestExpectedRunningVmsButNoneListed() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 0);
+        Assert.assertFalse(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestReceivedErrorCode() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, ERROR_CODE);
+        Assert.assertFalse(result);
+    }
+
+    private boolean isKvmHaAgentHealthyTests(int expectedNumberOfVms, int vmsRunningOnAgent) {
+        List<VMInstanceVO> vmsOnHostList = new ArrayList<>();
+        for (int i = 0; i < expectedNumberOfVms; i++) {
+            VMInstanceVO vmInstance = Mockito.mock(VMInstanceVO.class);
+            vmsOnHostList.add(vmInstance);
+        }
+
+        Mockito.doReturn(vmsOnHostList).when(kvmHaAgentClient).listVmsOnHost(Mockito.any(), Mockito.any());
+        Mockito.doReturn(vmsRunningOnAgent).when(kvmHaAgentClient).countRunningVmsOnAgent();
+
+        return kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestNull() {
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(null);
+        Assert.assertNull(responseJson);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTest() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_3VMs, 3l);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestOtherJsonExample() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_0VMs, 0l);
+    }
+
+    private void prepareAndTestProcessHttpResponseIntoJson(String jsonString, long expectedVmsCount) throws IOException {
+        CloseableHttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, jsonString);
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(mockedResponse);
+
+        Assert.assertNotNull(responseJson);
+        JsonElement jsonElementVmsCount = responseJson.get(VMS_COUNT);
+        JsonElement jsonElementVmsArray = responseJson.get(VIRTUAL_MACHINES);
+        JsonArray jsonArray = jsonElementVmsArray.getAsJsonArray();
+
+        Assert.assertEquals(expectedVmsCount, jsonArray.size());
+        Assert.assertEquals(expectedVmsCount, jsonElementVmsCount.getAsLong());
+        Assert.assertEquals(jsonString, responseJson.toString());
+    }
+
+    private CloseableHttpResponse mockResponse(int httpStatusCode, String jsonString) throws IOException {
+        BasicStatusLine basicStatusLine = new BasicStatusLine(new ProtocolVersion("HTTP", 1000, 123), httpStatusCode, "Status");
+        CloseableHttpResponse response = Mockito.mock(CloseableHttpResponse.class);
+        InputStream in = IOUtils.toInputStream(jsonString, StandardCharsets.UTF_8);
+        Mockito.when(response.getStatusLine()).thenReturn(basicStatusLine);
+        HttpEntity httpEntity = new InputStreamEntity(in);
+        Mockito.when(response.getEntity()).thenReturn(httpEntity);
+        return response;
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTest() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTestBlankNoVmsListed() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_0VMs, EXPECTED_RUNNING_VMS_EXAMPLE_0VMs);
+    }
+
+    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+
+        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+
+        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+        Assert.assertEquals(expectedListedVms, result);
+    }
+
+    @Test
+    public void retryHttpRequestTest() throws IOException {
+        kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(1)).execute(Mockito.any());
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+    }
+
+    @Test
+    public void retryHttpRequestTestNullResponse() throws IOException {
+        Mockito.doReturn(null).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Assert.assertNull(response);
+    }
+
+    @Test
+    public void retryHttpRequestTestForbidden() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_FORBIDDEN, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestMultipleChoices() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_MULTIPLE_CHOICES, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestProcessing() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_PROCESSING, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestTimeout() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_GATEWAY_TIMEOUT, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestVersionNotSupported() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_HTTP_VERSION_NOT_SUPPORTED, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestOk() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_OK, false);
+    }
+
+    private void prepareAndRunRetryHttpRequestTest(int scMultipleChoices, boolean expectNull) throws IOException {
+        HttpResponse mockedResponse = mockResponse(scMultipleChoices, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        if (expectNull) {
+            Assert.assertNull(response);
+        } else {
+            Assert.assertEquals(mockedResponse, response);
+        }
+    }
+
+    @Test
+    public void retryHttpRequestTestHttpOk() throws IOException {
+        HttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse result = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        Assert.assertEquals(mockedResponse, result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestOneIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs));
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNotNull(result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestTwoIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenThrow(IOException.class);
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNull(result);
+    }
+
+    @Test
+    public void isKvmHaWebserviceEnabledTestDefault() {
+        Assert.assertFalse(kvmHaAgentClient.isKvmHaWebserviceEnabled());
+    }
+
+    @Test
+    public void getKvmHaMicroservicePortValueTestDefault() {
+        Assert.assertEquals(KVM_HA_WEBSERVICE_PORT, kvmHaAgentClient.getKvmHaMicroservicePortValue());
+    }
+
+//    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+//        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+//        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+//
+//        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+//        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+//
+//        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+//        Assert.assertEquals(expectedListedVms, result);
+//    }
+//TODO
+//    @Test
+//    public void isTargetHostReachableTest() {
+//        kvmHaAgentClient.isTargetHostReachable(PRIVATE_IP_ADDRESS);
+//    }

Review comment:
       unused code (in comments) ^^^, can be removed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r627433177



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -85,12 +89,35 @@ public Status isAgentAlive(Host agent) {
                 break;
             }
         }
-        if (!hasNfs) {
-            s_logger.warn(
-                    "Agent investigation was requested on host " + agent + ", but host does not support investigation because it has no NFS storage. Skipping investigation.");
-            return Status.Disconnected;
+        Status agentStatus = Status.Disconnected;
+        if (hasNfs) {
+            agentStatus = checkAgentStatusViaNfs(agent);
+            s_logger.debug(String.format("Agent investigation was requested on host %s. Agent status via NFS heartbeat is %s.", agent, agentStatus));
+        } else {
+            s_logger.debug(String.format("Agent investigation was requested on host %s, but host has no NFS storage. Skipping investigation via NFS.", agent));
         }
 
+        agentStatus = checkAgentStatusViaKvmHaAgent(agent, agentStatus);
+
+        return agentStatus;
+    }
+
+    /**
+     * It checks the KVM node healthy via KVM HA Agent. If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    private Status checkAgentStatusViaKvmHaAgent(Host agent, Status agentStatus) {
+        KvmHaAgentClient kvmHaAgentClient = new KvmHaAgentClient(agent);
+        boolean isVmsCountOnKvmMatchingWithDatabase = kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+        if(isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            s_logger.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected."));
+        } else {
+            s_logger.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent"));

Review comment:
       Code has been updated adrressing these `String.format` missing parameters. Thanks @wido!

##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -85,12 +89,35 @@ public Status isAgentAlive(Host agent) {
                 break;
             }
         }
-        if (!hasNfs) {
-            s_logger.warn(
-                    "Agent investigation was requested on host " + agent + ", but host does not support investigation because it has no NFS storage. Skipping investigation.");
-            return Status.Disconnected;
+        Status agentStatus = Status.Disconnected;
+        if (hasNfs) {
+            agentStatus = checkAgentStatusViaNfs(agent);
+            s_logger.debug(String.format("Agent investigation was requested on host %s. Agent status via NFS heartbeat is %s.", agent, agentStatus));
+        } else {
+            s_logger.debug(String.format("Agent investigation was requested on host %s, but host has no NFS storage. Skipping investigation via NFS.", agent));
         }
 
+        agentStatus = checkAgentStatusViaKvmHaAgent(agent, agentStatus);
+
+        return agentStatus;
+    }
+
+    /**
+     * It checks the KVM node healthy via KVM HA Agent. If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    private Status checkAgentStatusViaKvmHaAgent(Host agent, Status agentStatus) {
+        KvmHaAgentClient kvmHaAgentClient = new KvmHaAgentClient(agent);
+        boolean isVmsCountOnKvmMatchingWithDatabase = kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+        if(isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            s_logger.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected."));
+        } else {
+            s_logger.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent"));

Review comment:
       Code has been updated adressing these `String.format` missing parameters. Thanks @wido!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-876010424


   Packaging result: :heavy_multiplication_x: el7 :heavy_multiplication_x: el8 :heavy_check_mark: debian. SL-JID 488


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r629568449



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -151,20 +203,33 @@ private boolean isVMActivtyOnHost(Host agent, DateTime suspectTime) throws HAChe
         if (agent.getHypervisorType() != Hypervisor.HypervisorType.KVM && agent.getHypervisorType() != Hypervisor.HypervisorType.LXC) {
             throw new IllegalStateException(String.format("Calling KVM investigator for non KVM Host of type [%s].", agent.getHypervisorType()));
         }
-        boolean activityStatus = true;
-        HashMap<StoragePool, List<Volume>> poolVolMap = getVolumeUuidOnHost(agent);
-        for (StoragePool pool : poolVolMap.keySet()) {
-            activityStatus = verifyActivityOfStorageOnHost(poolVolMap, pool, agent, suspectTime, activityStatus);
-            if (!activityStatus) {
-                LOG.warn(String.format("It seems that the storage pool [%s] does not have activity on %s.", pool.getId(), agent.toString()));
-                break;
+        boolean activityStatus = false;
+        if (isHostServedByNfsPool(agent)) {
+            HashMap<StoragePool, List<Volume>> poolVolMap = getVolumeUuidOnHost(agent);
+            for (StoragePool pool : poolVolMap.keySet()) {
+                if (Storage.StoragePoolType.NetworkFilesystem == pool.getPoolType() || Storage.StoragePoolType.ManagedNFS == pool.getPoolType()) {

Review comment:
       We could create a collection and verify if `contains` the pool type, like: 
   ```java
   private final Set<Storage.StoragePoolType> <insertNameHere> = new HashSet<>(Arrays.asList(Storage.StoragePoolType.NetworkFilesystem, Storage.StoragePoolType.ManagedNFS));
   ...
   if (<insertNameHere>.contains(pool.getPoolType())) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r655677278



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GutoVeronezi I decided to remove this catch.
   When checking the easySend there is already enough catches. If it does not catch the exception ... I don't know what would catch it:
   
   ```
   public Answer easySend(final Long hostId, final Command cmd) {
           try {
                   ...
                   ...
                   ...
           } catch (final AgentUnavailableException e) {
               s_logger.warn(e.getMessage());
               return null;
           } catch (final OperationTimedoutException e) {
               s_logger.warn("Operation timed out: " + e.getMessage());
               return null;
           } catch (final Exception e) {
               s_logger.warn("Exception while sending", e);
               return null;
           }
   ```
   
   For reference: [AgentManagerImpl.java#L938](https://github.com/apache/cloudstack/blob/4f6851f4c057a9524231e75285ba2f5257ff640b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java#L938)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-866301627


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634415280



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
+        int numberOfVmsOnAgent = countRunningVmsOnAgent();
+        if (numberOfVmsOnAgent < 0) {
+            LOGGER.error(String.format("KVM HA Agent health check failed, either the KVM Agent %s is unreachable or Libvirt validation failed.", agent));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        if (numberOfVmsOnHostAccordingToDb == numberOfVmsOnAgent) {
+            return true;
+        }
+        if (numberOfVmsOnAgent == 0 && numberOfVmsOnHostAccordingToDb > CAUTIOUS_MARGIN_OF_VMS_ON_HOST) {
+            // Return false as could not find VMs running but it expected at least one VM running, fencing/recovering host would avoid downtime to VMs in this case.
+            // There is cautious margin added on the conditional. This avoids fencing/recovering hosts when there is one VM migrating to a host that had zero VMs.
+            // If there are more VMs than the CAUTIOUS_MARGIN_OF_VMS_ON_HOST) the Host should be treated as not healthy and fencing/recovering process might be triggered.
+            LOGGER.warn(String.format("KVM HA Agent %s could not find VMs; it was expected to list %d VMs.", agent, numberOfVmsOnHostAccordingToDb));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        // In order to have a less "aggressive" health-check, the KvmHaAgentClient will not return false; fencing/recovering could bring downtime to existing VMs
+        // Additionally, the inconsistency can also be due to jobs in progress to migrate/stop/start VMs
+        // Either way, WARN messages should be presented to Admins so they can look closely to what is happening on the host
+        LOGGER.warn(String.format("KVM HA Agent %s listed %d VMs; however, it was expected %d VMs.", agent, numberOfVmsOnAgent, numberOfVmsOnHostAccordingToDb));
+        return true;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;
+    }
+
+    /**
+     * Re-executes the HTTP GET request until it gets a response or it reaches the maximum request retries {@link #MAX_REQUEST_RETRIES}
+     */
+    protected HttpResponse retryHttpRequest(String url, HttpRequestBase httpReq, HttpClient client) {
+        LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s]. Executing the request again.", httpReq.getMethod(), url));
+        HttpResponse response = retryUntilGetsHttpResponse(url, httpReq, client);
+
+        if (response == null) {
+            LOGGER.error(String.format("Failed to execute HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+            return response;
+        }
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES) {
+            LOGGER.error(
+                    String.format("Failed to get VMs information with a %s request to URL '%s'. The expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url,
+                            EXPECTED_HTTP_STATUS, statusCode));
+            return null;
+        }
+
+        LOGGER.debug(String.format("Successfully executed HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+        return response;
+    }
+
+    protected HttpResponse retryUntilGetsHttpResponse(String url, HttpRequestBase httpReq, HttpClient client) {
+        for (int attempt = 1; attempt < MAX_REQUEST_RETRIES + 1; attempt++) {
+            try {
+                TimeUnit.SECONDS.sleep(WAIT_FOR_REQUEST_RETRY);
+                LOGGER.debug(String.format("Retry HTTP %s request [URL: %s], attempt %d/%d.", httpReq.getMethod(), url, attempt, MAX_REQUEST_RETRIES));
+                return client.execute(httpReq);
+            } catch (IOException | InterruptedException e) {
+                String errorMessage = String.format("Failed to execute HTTP %s request retry attempt %d/%d [URL: %s] due to exception %s",
+                        httpReq.getMethod(), attempt, MAX_REQUEST_RETRIES, url, e);
+                LOGGER.error(errorMessage);
+            }
+        }
+        return null;
+    }
+
+    /**
+     * Processes the response of request GET System ID as a JSON object.<br>
+     * Json example: {"count": 3, "virtualmachines": ["r-123-VM", "v-134-VM", "s-111-VM"]}<br><br>
+     *
+     * Note: this method can return NULL JsonObject in case HttpResponse is NULL.
+     */
+    protected JsonObject processHttpResponseIntoJson(HttpResponse response) {
+        InputStream in;
+        String jsonString;
+        if (response == null) {
+            return null;
+        }
+        try {
+            in = response.getEntity().getContent();
+            BufferedReader streamReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
+            jsonString = streamReader.readLine();
+        } catch (UnsupportedOperationException | IOException e) {
+            throw new CloudRuntimeException("Failed to process response", e);
+        }
+
+        return new JsonParser().parse(jsonString).getAsJsonObject();

Review comment:
       Good point, I am going to update it on the next commit :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-849637111


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_check_mark: debian. SL-JID 102


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-870560977


   @rhtyd I will check the centos packaging, it still fails to build.
   Deb packages for the new service ha-helper are successfully built.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-934173009


   > You are right, this something to be careful about.
   > We've configured the service in a way that it always starts on boot and if the process/job is killed for any reason it gets restarted as well. The only way of stopping it is via systemd (e.g. systemctl stop cloudstack-hahelper.service)
   
   Could you maybe explore systemd itself, there are ways to use dependencies and `targets` to ensure the agent is always up and unless explicitly stopped by the admin. For example, there's also restart on failure option (https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=).
   
   > We did not implement such a way of telling that the agent has been "intentionally stopped". This would rely on Admins disabling it on the CloudStack side.
   > I will need to add some information in the documentation about how to handle the cluster with this agent.
   
   See above, most admins may not remember about this feature and I wonder is stopping an agent to do maintenance work could cause side-effects. Maybe look at my above suggestion on exploiting systemd features. Docs +1
   
   > I can look into a way of adding CA certificates and validate the communications. For now, it has no such validation; however, it binds only with the node IP in the management network (which in theory is an isolated/secure network).
   
   I think if adding a new service for this feature is unavoidable, we should absolutely (a) have the service use CA-framework issued certificates to serve using secured TLS/SSL certs (i.e. on https), (b) provide a default off option (which you've confirmed exists via a cluster-scope global setting), (c) have firewall-config enabled when the agent either starts (or the service/process starts?) or document on how to use this service (i.e. enable port 8080). (Probably not a good idea to expose whole of libvirtd over network, but one option may involve just exposing libvirtd over tls/ssh to other neighbour hosts https://libvirt.org/remote.html which won't require any additional services).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1005686488


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1005685909


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1005204385


   Packaging result: :heavy_multiplication_x: el7 :heavy_multiplication_x: el8 :heavy_multiplication_x: debian :heavy_multiplication_x: suse15. SL-JID 2087


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-914946691


   @rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-911728760


   Thanks for the tests @DaanHoogland. I am checking on them.
   Also, I am constantly hammering the HA implementation on a staging CloudStack DC with Redfish OOBM + KVM nodes.
   
   This is interesting. A different result but also impacting the HA tests.
   
   #### Prior tests had:
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   test_hostha_enable_ha_when_host_disabled | `Error` | 1.11 | test_hostha_kvm.py
   test_hostha_enable_ha_when_host_in_maintenance | `Error` | 303.92 | test_hostha_kvm.py
   
   - case 1, disabled: it was on a "wrong" state (maintenance) which did not allow to disable.
   ` updatehost failed, due to: errorCode: 530, errorText:Failed to update host:2,No next resource state found for current state = Maintenance event = Disable`
   - case 2, Maintenance: These tests were mainly related to a host already in maintenance that could not be replaced as in maintenance.
   `Failed to prepare host for maintenance due to: Host is already in state Maintenance.`
   
   I assume that some of the tests in the CI put a host in maintenance and this was not cleaned in time for these 2 tests that needed a host not in Maintenance
   
   
   #### Recent CI tests
   For the new CI spin, the test failing is a different one (I still need to check the logs):
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   test_hostha_kvm_host_degraded | `Failure` | 768.71 | test_hostha_kvm.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899504564


   <b>Trillian test result (tid-1645)</b>
   Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
   Total time taken: 35273 seconds
   Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4978-t1645-kvm-centos7.zip
   Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
   Smoke tests completed. 88 look OK, 1 have error(s)
   Only failed tests results shown below:
   
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   test_hostha_enable_ha_when_host_disabled | `Error` | 1.11 | test_hostha_kvm.py
   test_hostha_enable_ha_when_host_in_maintenance | `Error` | 303.92 | test_hostha_kvm.py
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r678261232



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -77,57 +77,84 @@ public Status isAgentAlive(Host agent) {
             return haManager.getHostStatus(agent);
         }
 
-        List<StoragePoolVO> clusterPools = _storagePoolDao.listPoolsByCluster(agent.getClusterId());
-        boolean hasNfs = false;
-        for (StoragePoolVO pool : clusterPools) {
-            if (pool.getPoolType() == StoragePoolType.NetworkFilesystem) {
-                hasNfs = true;
-                break;
-            }
+        Status agentStatus = Status.Disconnected;
+        boolean hasNfs = isHostServedByNfsPool(agent);
+        if (hasNfs) {
+            agentStatus = checkAgentStatusViaNfs(agent);
+            s_logger.debug(String.format("Agent investigation was requested on host %s. Agent status via NFS heartbeat is %s.", agent, agentStatus));
+        } else {
+            s_logger.debug(String.format("Agent investigation was requested on host %s, but host has no NFS storage. Skipping investigation via NFS.", agent));
+        }
+
+        boolean isKvmHaWebserviceEnabled = kvmHaHelper.isKvmHaWebserviceEnabled(agent);
+        if (isKvmHaWebserviceEnabled) {
+            agentStatus = kvmHaHelper.checkAgentStatusViaKvmHaAgent(agent, agentStatus);
         }
+
+        return agentStatus;
+    }
+
+    private boolean isHostServedByNfsPool(Host agent) {
+        boolean hasNfs = hasNfsPoolClusterWideForHost(agent);
         if (!hasNfs) {
-            List<StoragePoolVO> zonePools = _storagePoolDao.findZoneWideStoragePoolsByHypervisor(agent.getDataCenterId(), agent.getHypervisorType());
-            for (StoragePoolVO pool : zonePools) {
-                if (pool.getPoolType() == StoragePoolType.NetworkFilesystem) {
-                    hasNfs = true;
-                    break;
-                }
+            hasNfs = hasNfsPoolZoneWideForHost(agent);
+        }
+        return hasNfs;
+    }
+
+    private boolean hasNfsPoolZoneWideForHost(Host agent) {
+        List<StoragePoolVO> zonePools = _storagePoolDao.findZoneWideStoragePoolsByHypervisor(agent.getDataCenterId(), agent.getHypervisorType());
+        for (StoragePoolVO pool : zonePools) {
+            if (pool.getPoolType() == StoragePoolType.NetworkFilesystem) {
+                return true;
             }
         }
-        if (!hasNfs) {
-            s_logger.warn(
-                    "Agent investigation was requested on host " + agent + ", but host does not support investigation because it has no NFS storage. Skipping investigation.");
-            return Status.Disconnected;
+        return false;

Review comment:
       Done!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-881619108


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-865317286


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908435463


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 1062


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r640094589



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAConfig.java
##########
@@ -53,4 +53,17 @@
     public static final ConfigKey<Long> KvmHAFenceTimeout = new ConfigKey<>("Advanced", Long.class, "kvm.ha.fence.timeout", "60",
             "The maximum length of time, in seconds, expected for a fence operation to complete.", true, ConfigKey.Scope.Cluster);
 
+    public static final ConfigKey<Integer> KvmHaWebservicePort = new ConfigKey<Integer>("Advanced", Integer.class, "kvm.ha.webservice.port", "8080",
+            "It sets the port used to communicate with the KVM HA Agent Microservice that is running on KVM nodes. Default value is 8080.",
+            true, ConfigKey.Scope.Cluster);
+
+    public static final ConfigKey<Boolean> IsKvmHaWebserviceEnabled = new ConfigKey<Boolean>("Advanced", Boolean.class, "kvm.ha.webservice.enabled", "false",
+            "The KVM HA Webservice is executed on the KVM node and checks the amount of VMs running via libvirt. It serves as a HA health-check for KVM nodes. "
+                    + "One can enable (set to 'true') or disable it ('false'). If disabled then CloudStack ignores HA validation via this agent.",
+            true, ConfigKey.Scope.Cluster);
+
+    public static final ConfigKey<Double> KvmHaAcceptedProblematicHostsRatio = new ConfigKey<Double>("Advanced", Double.class, "kvm.ha.accepted.problematic.hosts.ratio", "0.3",
+            "The ratio of problematic Hosts accepted on a Cluster. If a cluster has more than the accepted ratio, HA will not be Fence/Recover Hosts and Admins will be notified to check the cluster healthy. "

Review comment:
       ```suggestion
               "The ratio of problematic Hosts accepted on a Cluster. If a cluster has more than the accepted ratio, HA will not Fence/Recover Hosts; instead, it will notify Admins to check the cluster healthy. "
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] NuxRo commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
NuxRo commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-926631496


   @GabrielBrascher Sorry, it was just a typo on my part (going for host:port**;** for some reason instead of host:port .. what a waste of time).
   
   I'll try to get this tested without NFS and see if it works as it should.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-870344180






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r665702938



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final double PROBLEMATIC_HOSTS_RATIO_ACCEPTED = 0.3;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.value());
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {
+                boolean isReachable = kvmHaAgentClient.isHostReachableByNeighbour(neighbor, host);
+                if (isReachable) {
+                    String.format("%s is reachable by neighbour %s. If CloudStack is failing to reach the respective host then it is probably a network issue between the host "
+                            + "and CloudStack management server.", host, neighbor);
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the host is healthy. The health-check is performed via HTTP GET request to a service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    public boolean isKvmHealthyCheckViaLibvirt(Host host) {
+        boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
+
+        if (!isKvmHaAgentHealthy) {
+            if (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) {
+                return true;
+            }
+        }
+
+        return isKvmHaAgentHealthy;

Review comment:
       Personally, I do prefer _IFs_ than a _ternary_. At least in such cases where a ternary transforms into a huge line. But that might be just me used to the "old fashion" _IFs_ ... :thinking:.
   
   I have nothing against ternary though. I can definitely do such a change if it looks like a cleaner and "easier" code to read through.
   
   1. From:
   ```
   public boolean isKvmHealthyCheckViaLibvirt(Host host) {
       boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
   
       if (!isKvmHaAgentHealthy) {
           if (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) {
           return true;
           }
       }
       
       return isKvmHaAgentHealthy;
   }
   ```
   
   2. To:
   ```
   public boolean isKvmHealthyCheckViaLibvirt(Host host) {
       boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
       return !isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) ? true : isKvmHaAgentHealthy;
   }
   ```
   
   OR
   ```
   public boolean isKvmHealthyCheckViaLibvirt(Host host) {
       boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
       return !isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host))
       ? true
       : isKvmHaAgentHealthy;
   }
   ```
   
   ... OR
   ```
   public boolean isKvmHealthyCheckViaLibvirt(Host host) {
       boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
       boolean isKvmHealthyCheckViaLibvirt = 
           !isKvmHaAgentHealthy && (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host))
                   ? true
                   : isKvmHaAgentHealthy;
       return isKvmHealthyCheckViaLibvirt;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-914909057


   @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634407719



##########
File path: plugins/hypervisors/kvm/src/test/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClientTest.java
##########
@@ -0,0 +1,278 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.http.HttpEntity;
+import org.apache.http.HttpResponse;
+import org.apache.http.HttpStatus;
+import org.apache.http.ProtocolVersion;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.entity.InputStreamEntity;
+import org.apache.http.message.BasicStatusLine;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.mockito.Mock;
+import org.mockito.Mockito;
+import org.mockito.junit.MockitoJUnitRunner;
+
+import com.cloud.host.HostVO;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.dao.VMInstanceDaoImpl;
+import com.google.gson.JsonArray;
+import com.google.gson.JsonElement;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+
+@RunWith(MockitoJUnitRunner.class)
+public class KvmHaAgentClientTest {
+
+    private static final int ERROR_CODE = -1;
+    private HostVO agent = Mockito.mock(HostVO.class);
+    private KvmHaAgentClient kvmHaAgentClient = Mockito.spy(new KvmHaAgentClient(agent));
+    private static final int DEFAULT_PORT = 8080;
+    private static final String PRIVATE_IP_ADDRESS = "1.2.3.4";
+    private static final String JSON_STRING_EXAMPLE_3VMs = "{\"count\":3,\"virtualmachines\":[\"r-123-VM\",\"v-134-VM\",\"s-111-VM\"]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_3VMs = 3;
+    private static final String JSON_STRING_EXAMPLE_0VMs = "{\"count\":0,\"virtualmachines\":[]}";
+    private static final int EXPECTED_RUNNING_VMS_EXAMPLE_0VMs = 0;
+    private static final String EXPECTED_URL = String.format("http://%s:%d", PRIVATE_IP_ADDRESS, DEFAULT_PORT);
+    private static final HttpRequestBase HTTP_REQUEST_BASE = new HttpGet(EXPECTED_URL);
+    private static final String VMS_COUNT = "count";
+    private static final String VIRTUAL_MACHINES = "virtualmachines";
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int KVM_HA_WEBSERVICE_PORT = 8080;
+
+    @Mock
+    HttpClient client;
+
+    @Mock
+    VMInstanceDaoImpl vmInstanceDao;
+
+    @Test
+    public void isKvmHaAgentHealthyTestAllGood() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestVMsDoNotMatchButDoNotReturnFalse() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 1);
+        Assert.assertTrue(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestExpectedRunningVmsButNoneListed() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, 0);
+        Assert.assertFalse(result);
+    }
+
+    @Test
+    public void isKvmHaAgentHealthyTestReceivedErrorCode() {
+        boolean result = isKvmHaAgentHealthyTests(EXPECTED_RUNNING_VMS_EXAMPLE_3VMs, ERROR_CODE);
+        Assert.assertFalse(result);
+    }
+
+    private boolean isKvmHaAgentHealthyTests(int expectedNumberOfVms, int vmsRunningOnAgent) {
+        List<VMInstanceVO> vmsOnHostList = new ArrayList<>();
+        for (int i = 0; i < expectedNumberOfVms; i++) {
+            VMInstanceVO vmInstance = Mockito.mock(VMInstanceVO.class);
+            vmsOnHostList.add(vmInstance);
+        }
+
+        Mockito.doReturn(vmsOnHostList).when(kvmHaAgentClient).listVmsOnHost(Mockito.any(), Mockito.any());
+        Mockito.doReturn(vmsRunningOnAgent).when(kvmHaAgentClient).countRunningVmsOnAgent();
+
+        return kvmHaAgentClient.isKvmHaAgentHealthy(agent, vmInstanceDao);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestNull() {
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(null);
+        Assert.assertNull(responseJson);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTest() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_3VMs, 3l);
+    }
+
+    @Test
+    public void processHttpResponseIntoJsonTestOtherJsonExample() throws IOException {
+        prepareAndTestProcessHttpResponseIntoJson(JSON_STRING_EXAMPLE_0VMs, 0l);
+    }
+
+    private void prepareAndTestProcessHttpResponseIntoJson(String jsonString, long expectedVmsCount) throws IOException {
+        CloseableHttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, jsonString);
+        JsonObject responseJson = kvmHaAgentClient.processHttpResponseIntoJson(mockedResponse);
+
+        Assert.assertNotNull(responseJson);
+        JsonElement jsonElementVmsCount = responseJson.get(VMS_COUNT);
+        JsonElement jsonElementVmsArray = responseJson.get(VIRTUAL_MACHINES);
+        JsonArray jsonArray = jsonElementVmsArray.getAsJsonArray();
+
+        Assert.assertEquals(expectedVmsCount, jsonArray.size());
+        Assert.assertEquals(expectedVmsCount, jsonElementVmsCount.getAsLong());
+        Assert.assertEquals(jsonString, responseJson.toString());
+    }
+
+    private CloseableHttpResponse mockResponse(int httpStatusCode, String jsonString) throws IOException {
+        BasicStatusLine basicStatusLine = new BasicStatusLine(new ProtocolVersion("HTTP", 1000, 123), httpStatusCode, "Status");
+        CloseableHttpResponse response = Mockito.mock(CloseableHttpResponse.class);
+        InputStream in = IOUtils.toInputStream(jsonString, StandardCharsets.UTF_8);
+        Mockito.when(response.getStatusLine()).thenReturn(basicStatusLine);
+        HttpEntity httpEntity = new InputStreamEntity(in);
+        Mockito.when(response.getEntity()).thenReturn(httpEntity);
+        return response;
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTest() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_3VMs, EXPECTED_RUNNING_VMS_EXAMPLE_3VMs);
+    }
+
+    @Test
+    public void countRunningVmsOnAgentTestBlankNoVmsListed() throws IOException {
+        prepareAndRunCountRunningVmsOnAgent(JSON_STRING_EXAMPLE_0VMs, EXPECTED_RUNNING_VMS_EXAMPLE_0VMs);
+    }
+
+    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+
+        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+
+        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+        Assert.assertEquals(expectedListedVms, result);
+    }
+
+    @Test
+    public void retryHttpRequestTest() throws IOException {
+        kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(1)).execute(Mockito.any());
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+    }
+
+    @Test
+    public void retryHttpRequestTestNullResponse() throws IOException {
+        Mockito.doReturn(null).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Assert.assertNull(response);
+    }
+
+    @Test
+    public void retryHttpRequestTestForbidden() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_FORBIDDEN, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestMultipleChoices() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_MULTIPLE_CHOICES, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestProcessing() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_PROCESSING, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestTimeout() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_GATEWAY_TIMEOUT, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestVersionNotSupported() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_HTTP_VERSION_NOT_SUPPORTED, true);
+    }
+
+    @Test
+    public void retryHttpRequestTestOk() throws IOException {
+        prepareAndRunRetryHttpRequestTest(HttpStatus.SC_OK, false);
+    }
+
+    private void prepareAndRunRetryHttpRequestTest(int scMultipleChoices, boolean expectNull) throws IOException {
+        HttpResponse mockedResponse = mockResponse(scMultipleChoices, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse response = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        if (expectNull) {
+            Assert.assertNull(response);
+        } else {
+            Assert.assertEquals(mockedResponse, response);
+        }
+    }
+
+    @Test
+    public void retryHttpRequestTestHttpOk() throws IOException {
+        HttpResponse mockedResponse = mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs);
+        Mockito.doReturn(mockedResponse).when(kvmHaAgentClient).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        HttpResponse result = kvmHaAgentClient.retryHttpRequest(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(kvmHaAgentClient, Mockito.times(1)).retryUntilGetsHttpResponse(Mockito.anyString(), Mockito.any(), Mockito.any());
+        Assert.assertEquals(mockedResponse, result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestOneIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs));
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNotNull(result);
+    }
+
+    @Test
+    public void retryUntilGetsHttpResponseTestTwoIOException() throws IOException {
+        Mockito.when(client.execute(HTTP_REQUEST_BASE)).thenThrow(IOException.class).thenThrow(IOException.class);
+        HttpResponse result = kvmHaAgentClient.retryUntilGetsHttpResponse(EXPECTED_URL, HTTP_REQUEST_BASE, client);
+        Mockito.verify(client, Mockito.times(MAX_REQUEST_RETRIES)).execute(Mockito.any());
+        Assert.assertNull(result);
+    }
+
+    @Test
+    public void isKvmHaWebserviceEnabledTestDefault() {
+        Assert.assertFalse(kvmHaAgentClient.isKvmHaWebserviceEnabled());
+    }
+
+    @Test
+    public void getKvmHaMicroservicePortValueTestDefault() {
+        Assert.assertEquals(KVM_HA_WEBSERVICE_PORT, kvmHaAgentClient.getKvmHaMicroservicePortValue());
+    }
+
+//    private void prepareAndRunCountRunningVmsOnAgent(String jsonStringExample, int expectedListedVms) throws IOException {
+//        Mockito.when(agent.getPrivateIpAddress()).thenReturn(PRIVATE_IP_ADDRESS);
+//        Mockito.doReturn(mockResponse(HttpStatus.SC_OK, JSON_STRING_EXAMPLE_3VMs)).when(kvmHaAgentClient).executeHttpRequest(EXPECTED_URL);
+//
+//        JsonObject jObject = new JsonParser().parse(jsonStringExample).getAsJsonObject();
+//        Mockito.doReturn(jObject).when(kvmHaAgentClient).processHttpResponseIntoJson(Mockito.any(HttpResponse.class));
+//
+//        int result = kvmHaAgentClient.countRunningVmsOnAgent();
+//        Assert.assertEquals(expectedListedVms, result);
+//    }
+//TODO
+//    @Test
+//    public void isTargetHostReachableTest() {
+//        kvmHaAgentClient.isTargetHostReachable(PRIVATE_IP_ADDRESS);
+//    }

Review comment:
       You are right @sureshanaparti, this commit has added some WIP code, and just converted back to draft.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r643105521



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       - Should an exception be logged in `debug`? 
   - We could pass the exception as parameter to the log.
   - Do we need a Pokemon here?

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -59,29 +68,62 @@
     @Inject
     private AgentManager agentMgr;
     @Inject
-    private PrimaryDataStoreDao storagePool;
-    @Inject
     private StorageManager storageManager;
     @Inject
+    private PrimaryDataStoreDao storagePool;
+    @Inject
     private ResourceManager resourceManager;
+    @Inject
+    private StoragePoolHostDao storagePoolHostDao;
+    @Inject
+    private KvmHaHelper kvmHaHelper;
+
+    private static final Set<Storage.StoragePoolType> NFS_POOL_TYPE = new HashSet<>(Arrays.asList(Storage.StoragePoolType.NetworkFilesystem, Storage.StoragePoolType.ManagedNFS));
 
     @Override
-    public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException {
+    public boolean isActive(Host host, DateTime suspectTime) throws HACheckerException {
         try {
-            return isVMActivtyOnHost(r, suspectTime);
+            return isVMActivtyOnHost(host, suspectTime);
         } catch (HACheckerException e) {
             //Re-throwing the exception to avoid poluting the 'HACheckerException' already thrown
             throw e;
-        } catch (Exception e){
-            String message = String.format("Operation timed out, probably the %s is not reachable.", r.toString());
+        } catch (Exception e) {
+            String message = String.format("Operation timed out, probably the %s is not reachable.", host.toString());
             LOG.warn(message, e);
             throw new HACheckerException(message, e);
         }
     }
 
     @Override
-    public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+    public boolean isHealthy(Host host) {
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(host);
+        boolean isKvmHaWebserviceEnabled = kvmHaHelper.isKvmHaWebserviceEnabled(host);
+
+        if(isHostServedByNfsPool) {
+            isHealthy = isHealthViaNfs(host);
+        }
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        if (kvmHaHelper.isKvmHealthyCheckViaLibvirt(host) && !isHealthy) {
+            isHealthy = true;

Review comment:
       We could just return `true` here (it is not a technical reason, just another point of view).

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -151,20 +193,34 @@ private boolean isVMActivtyOnHost(Host agent, DateTime suspectTime) throws HAChe
         if (agent.getHypervisorType() != Hypervisor.HypervisorType.KVM && agent.getHypervisorType() != Hypervisor.HypervisorType.LXC) {

Review comment:
       We could create a constant with these hypervisor types and verify if contains.

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,256 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.host.Status;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import javax.inject.Inject;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK_NEIGHBOUR = "check-neighbour";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final JsonParser JSON_PARSER = new JsonParser();
+
+    @Inject
+    private VMInstanceDao vmInstanceDao;
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    public int countRunningVmsOnAgent(Host host) {
+        String url = String.format("http://%s:%d", host.getPrivateIpAddress(), getKvmHaMicroservicePortValue(host));
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    protected int getKvmHaMicroservicePortValue(Host host) {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), host.getClusterId(), host));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    public List<VMInstanceVO> listVmsOnHost(Host host) {
+        List<VMInstanceVO> listByHostAndStates = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running, VirtualMachine.State.Stopping, VirtualMachine.State.Migrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            long runningVMs = listByHostAndStates.stream().filter(vm -> vm.getState() == VirtualMachine.State.Running).count();
+            long stoppingVms = listByHostAndStates.stream().filter(vm -> vm.getState() == VirtualMachine.State.Stopping).count();
+            long migratingVms = listByHostAndStates.stream().filter(vm -> vm.getState() == VirtualMachine.State.Migrating).count();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent(host);
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", host.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndStates.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndStates;
+    }
+
+    /**
+     *  Sends HTTP GET request from the host executing the KVM HA Agent webservice to a target Host (expected to also be running the KVM HA Agent).
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the request gets a HTTP_OK OR {"status": "Down"} if HTTP GET failed
+     */
+    public boolean isHostReachableByNeighbour(Host neighbour, Host target) {
+        String neighbourHostAddress = neighbour.getPrivateIpAddress();
+        String targetHostAddress = target.getPrivateIpAddress();
+        int port = getKvmHaMicroservicePortValue(neighbour);
+        String url = String.format("http://%s:%d/%s/%s:%d", neighbourHostAddress, port, CHECK_NEIGHBOUR, targetHostAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null)
+            return false;
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (isHttpStatusCodNotOk(statusCode)) {
+            LOGGER.error(
+                    String.format("Failed HTTP %s Request %s; the expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url, EXPECTED_HTTP_STATUS, statusCode));
+            return false;
+        }
+
+        String hostStatusFromJson = responseInJson.get(STATUS).getAsString();
+        return Status.Up.toString().equals(hostStatusFromJson);
+    }
+
+    protected boolean isHttpStatusCodNotOk(int statusCode) {
+        return statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    @Nullable
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            response = retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;
+    }
+
+    /**
+     * Re-executes the HTTP GET request until it gets a response or it reaches the maximum request retries {@link #MAX_REQUEST_RETRIES}.
+     */
+    @Nullable
+    protected HttpResponse retryHttpRequest(String url, HttpRequestBase httpReq, HttpClient client) {
+        LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s]. Executing the request again.", httpReq.getMethod(), url));
+        HttpResponse response = retryUntilGetsHttpResponse(url, httpReq, client);
+
+        if (response == null) {
+            LOGGER.error(String.format("Failed to execute HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+            return response;
+        }
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (isHttpStatusCodNotOk(statusCode)) {
+            LOGGER.error(
+                    String.format("Failed to get VMs information with a %s request to URL '%s'. The expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url,
+                            EXPECTED_HTTP_STATUS, statusCode));
+            return null;
+        }
+
+        LOGGER.debug(String.format("Successfully executed HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+        return response;
+    }
+
+    /**
+     * Retry HTTP Request until success or it reaches {@link #MAX_REQUEST_RETRIES} retries. It can return null.
+     */
+    @Nullable
+    protected HttpResponse retryUntilGetsHttpResponse(String url, HttpRequestBase httpReq, HttpClient client) {
+        for (int attempt = 1; attempt <= MAX_REQUEST_RETRIES; attempt++) {
+            try {
+                TimeUnit.SECONDS.sleep(WAIT_FOR_REQUEST_RETRY);
+                LOGGER.debug(String.format("Retry HTTP %s request [URL: %s], attempt %d/%d.", httpReq.getMethod(), url, attempt, MAX_REQUEST_RETRIES));
+                return client.execute(httpReq);
+            } catch (IOException | InterruptedException e) {
+                String errorMessage = String.format("Failed to execute HTTP %s request retry attempt %d/%d [URL: %s] due to exception %s",
+                        httpReq.getMethod(), attempt, MAX_REQUEST_RETRIES, url, e);
+                LOGGER.error(errorMessage);

Review comment:
       We could pass the exception as parameter to the log.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r665762606



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final double PROBLEMATIC_HOSTS_RATIO_ACCEPTED = 0.3;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.value());
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {
+                boolean isReachable = kvmHaAgentClient.isHostReachableByNeighbour(neighbor, host);
+                if (isReachable) {
+                    String.format("%s is reachable by neighbour %s. If CloudStack is failing to reach the respective host then it is probably a network issue between the host "
+                            + "and CloudStack management server.", host, neighbor);
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the host is healthy. The health-check is performed via HTTP GET request to a service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    public boolean isKvmHealthyCheckViaLibvirt(Host host) {
+        boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
+
+        if (!isKvmHaAgentHealthy) {
+            if (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) {
+                return true;
+            }
+        }
+
+        return isKvmHaAgentHealthy;

Review comment:
       @GutoVeronezi your suggestion makes sense. I've updated the code removing the nested IF.
   Thanks for the review!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] sureshanaparti commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
sureshanaparti commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1005151325


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063018876


   @nvazquez rebased to the current main branch.
   PR is back to "ready to review"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063126934


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-863660100


   @blueorangutan package 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-876137237


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] wido commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
wido commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-836766195


   > HI, I see this PR is very good.
   > 
   > I have a question, if the manager network link is interrupted (such as a switch failure), but the storage network of kvm is still running normally.
   > 
   > Will this cause the VM to trigger HA?
   > 
   > I am worried that the image is "double written", resulting in damage to the image
   
   You should always have the Out of Band management of the hypervisors configured:
   
   - IPMI 
   - Redfish
   
   When the mgmt server wants to perform HA it will Fence off  the Host by performing a power cycle of that host via OOB.
   
   This will make sure no VMs are running and they can safely be started on a different host.
   
   In addition we also have locking mechanisms:
   
   - file locks on QCOW2 when using NFS (modern Qemu required)
   - exclusive locking with RBD/Ceph on images


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-875997033


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-919743838


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 1256


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-927682543


   @wido @GabrielBrascher I agree we can't assume mgmt server has SSH access however;
   - we should explore options to implement this without introducing a new service (my main concern is from security and upgrade point of view, a lot of people don't like non-essential services running on hypervisor)
   - for example, (1) what if I the admin wants to do some maintainance etc which requires stopping of the agent - in that case could your changes cause any side-effect, (2) systemd can be configured (probably already is?) to have this service always start on boot and on-crash/on-error
   - agent has a stop command answer it can tell mgmt server why it is stopping - that can be used intelligently to not cause HA led migrations (I haven't checked, probably already-is?)
   - if this new service is essential, can it be secured using CA-framework generated certificates so at least the communication is validated (the simplest being server certificate was signed/created against the root CA cert)
   - and a global setting/kill-switch for users who don't want/need this additional feature/service (for ex. NFS users?) and have it disabled by default


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634421214



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,295 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK = "check";
+    private static final String UP = "Up";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running according to the Libvirt
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    /**
+     *  Executes ping command from the host executing the KVM HA Agent webservice to a target IP Address.
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the IP address is reachable OR {"status": "Down"} if could not ping the IP
+     */
+    protected boolean isTargetHostReachable(String ipAddress) {
+        int port = getKvmHaMicroservicePortValue();
+        String url = String.format("http://%s:%d/%s/%s:%d", agent.getPrivateIpAddress(), port, CHECK, ipAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return false;
+        }
+
+        return UP.equals(responseInJson.get(STATUS).getAsString());
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
+        int numberOfVmsOnAgent = countRunningVmsOnAgent();
+        if (numberOfVmsOnAgent < 0) {
+            LOGGER.error(String.format("KVM HA Agent health check failed, either the KVM Agent %s is unreachable or Libvirt validation failed.", agent));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        if (numberOfVmsOnHostAccordingToDb == numberOfVmsOnAgent) {
+            return true;
+        }
+        if (numberOfVmsOnAgent == 0 && numberOfVmsOnHostAccordingToDb > CAUTIOUS_MARGIN_OF_VMS_ON_HOST) {
+            // Return false as could not find VMs running but it expected at least one VM running, fencing/recovering host would avoid downtime to VMs in this case.
+            // There is cautious margin added on the conditional. This avoids fencing/recovering hosts when there is one VM migrating to a host that had zero VMs.
+            // If there are more VMs than the CAUTIOUS_MARGIN_OF_VMS_ON_HOST) the Host should be treated as not healthy and fencing/recovering process might be triggered.
+            LOGGER.warn(String.format("KVM HA Agent %s could not find VMs; it was expected to list %d VMs.", agent, numberOfVmsOnHostAccordingToDb));
+            LOGGER.warn(String.format("Host %s is not considered healthy and HA fencing/recovering process might be triggered.", agent.getName(), numberOfVmsOnHostAccordingToDb));
+            return false;
+        }
+        // In order to have a less "aggressive" health-check, the KvmHaAgentClient will not return false; fencing/recovering could bring downtime to existing VMs
+        // Additionally, the inconsistency can also be due to jobs in progress to migrate/stop/start VMs
+        // Either way, WARN messages should be presented to Admins so they can look closely to what is happening on the host
+        LOGGER.warn(String.format("KVM HA Agent %s listed %d VMs; however, it was expected %d VMs.", agent, numberOfVmsOnAgent, numberOfVmsOnHostAccordingToDb));
+        return true;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;
+    }
+
+    /**
+     * Re-executes the HTTP GET request until it gets a response or it reaches the maximum request retries {@link #MAX_REQUEST_RETRIES}
+     */
+    protected HttpResponse retryHttpRequest(String url, HttpRequestBase httpReq, HttpClient client) {
+        LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s]. Executing the request again.", httpReq.getMethod(), url));
+        HttpResponse response = retryUntilGetsHttpResponse(url, httpReq, client);
+
+        if (response == null) {
+            LOGGER.error(String.format("Failed to execute HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+            return response;
+        }
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES) {
+            LOGGER.error(
+                    String.format("Failed to get VMs information with a %s request to URL '%s'. The expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url,
+                            EXPECTED_HTTP_STATUS, statusCode));
+            return null;
+        }
+
+        LOGGER.debug(String.format("Successfully executed HTTP %s request [URL: %s].", httpReq.getMethod(), url));
+        return response;
+    }
+
+    protected HttpResponse retryUntilGetsHttpResponse(String url, HttpRequestBase httpReq, HttpClient client) {
+        for (int attempt = 1; attempt < MAX_REQUEST_RETRIES + 1; attempt++) {
+            try {
+                TimeUnit.SECONDS.sleep(WAIT_FOR_REQUEST_RETRY);
+                LOGGER.debug(String.format("Retry HTTP %s request [URL: %s], attempt %d/%d.", httpReq.getMethod(), url, attempt, MAX_REQUEST_RETRIES));
+                return client.execute(httpReq);
+            } catch (IOException | InterruptedException e) {
+                String errorMessage = String.format("Failed to execute HTTP %s request retry attempt %d/%d [URL: %s] due to exception %s",
+                        httpReq.getMethod(), attempt, MAX_REQUEST_RETRIES, url, e);
+                LOGGER.error(errorMessage);
+            }
+        }
+        return null;
+    }
+
+    /**
+     * Processes the response of request GET System ID as a JSON object.<br>
+     * Json example: {"count": 3, "virtualmachines": ["r-123-VM", "v-134-VM", "s-111-VM"]}<br><br>
+     *
+     * Note: this method can return NULL JsonObject in case HttpResponse is NULL.
+     */
+    protected JsonObject processHttpResponseIntoJson(HttpResponse response) {
+        InputStream in;
+        String jsonString;
+        if (response == null) {
+            return null;
+        }
+        try {
+            in = response.getEntity().getContent();
+            BufferedReader streamReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
+            jsonString = streamReader.readLine();
+        } catch (UnsupportedOperationException | IOException e) {
+            throw new CloudRuntimeException("Failed to process response", e);
+        }
+
+        return new JsonParser().parse(jsonString).getAsJsonObject();

Review comment:
       @GutoVeronezi do you mean something like this, right?
   ```
   public class KvmHaAgentClient {
       private static final JsonParser JSON_PARSER = new JsonParser();
       ...
       ...
       ...
       @Nullable
       protected JsonObject processHttpResponseIntoJson(HttpResponse response) {
           if (response == null) {
               return null;
           }
           try {
               InputStream in = response.getEntity().getContent();
               BufferedReader streamReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
               return JSON_PARSER.parse(streamReader.readLine()).getAsJsonObject();
           } catch (UnsupportedOperationException | IOException e) {
               throw new CloudRuntimeException("Failed to process response", e);
           }
       }
       ...
       ...
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan removed a comment on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan removed a comment on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-876137237


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-865316576


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r658238673



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -59,29 +68,63 @@
     @Inject
     private AgentManager agentMgr;
     @Inject
-    private PrimaryDataStoreDao storagePool;
-    @Inject
     private StorageManager storageManager;
     @Inject
+    private PrimaryDataStoreDao storagePool;
+    @Inject
     private ResourceManager resourceManager;
+    @Inject
+    private StoragePoolHostDao storagePoolHostDao;
+    @Inject
+    private KvmHaHelper kvmHaHelper;
+
+    private static final Set<Storage.StoragePoolType> NFS_POOL_TYPE = new HashSet<>(Arrays.asList(Storage.StoragePoolType.NetworkFilesystem, Storage.StoragePoolType.ManagedNFS));
+    private static final Set<Hypervisor.HypervisorType> KVM_OR_LXC = new HashSet<>(Arrays.asList(Hypervisor.HypervisorType.KVM, Hypervisor.HypervisorType.LXC));
 
     @Override
-    public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException {
+    public boolean isActive(Host host, DateTime suspectTime) throws HACheckerException {
         try {
-            return isVMActivtyOnHost(r, suspectTime);
+            return isVMActivtyOnHost(host, suspectTime);
         } catch (HACheckerException e) {
             //Re-throwing the exception to avoid poluting the 'HACheckerException' already thrown
             throw e;
-        } catch (Exception e){
-            String message = String.format("Operation timed out, probably the %s is not reachable.", r.toString());
+        } catch (Exception e) {
+            String message = String.format("Operation timed out, probably the %s is not reachable.", host.toString());
             LOG.warn(message, e);
             throw new HACheckerException(message, e);
         }
     }
 
     @Override
-    public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+    public boolean isHealthy(Host host) {
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(host);
+        boolean isKvmHaWebserviceEnabled = kvmHaHelper.isKvmHaWebserviceEnabled(host);
+
+        if (isHostServedByNfsPool) {
+            isHealthy = isHealthViaNfs(host);
+        }
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        if (kvmHaHelper.isKvmHealthyCheckViaLibvirt(host) && !isHealthy) {
+            return true;
+        }
+
+        return isHealthy;
+    }
+
+    private boolean isHealthViaNfs(Host r) {
+        boolean isHealthy = true;
+        if (isHostServedByNfsPool(r)) {
+            isHealthy = isAgentActive(r);
+            if (!isHealthy) {
+                LOG.warn(String.format("NFS storage health check failed for %s. It seems that a storage does not have activity.", r.toString()));
+            }
+        }

Review comment:
       Nice! I like the idea @GutoVeronezi, I will enhance this part and ensure to cover the flows with some relevant logs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r830307629



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,346 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.host.Status;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
+import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.http.impl.client.HttpClients;
+import org.apache.http.ssl.SSLContexts;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import javax.inject.Inject;
+import javax.net.ssl.SSLContext;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.security.KeyManagementException;
+import java.security.KeyStoreException;
+import java.security.NoSuchAlgorithmException;
+import java.util.Base64;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK_NEIGHBOUR = "check-neighbour";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final JsonParser JSON_PARSER = new JsonParser();
+    static final String HTTP_PROTOCOL = "http";
+    static final String HTTPS_PROTOCOL = "https";
+    private final static String APPLICATION_JSON = "application/json";
+    private final static String ACCEPT = "accept";
+
+    @Inject
+    private VMInstanceDao vmInstanceDao;

Review comment:
       I think some of the logic on this class could be placed on other class (maybe KvmHaHelper) as long with the DB access, and keep the client class simply to interact with the HA agent




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-865673337


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_multiplication_x: debian. SL-JID 320


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-870560977


   @rhtyd I will check the centos packaging, it still fails to build.
   Deb packages for the new service ha-helper are successfully built.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899177887


   @nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r708903698



##########
File path: debian/control
##########
@@ -56,3 +56,10 @@ Package: cloudstack-integration-tests
 Architecture: all
 Depends: ${misc:Depends}, cloudstack-marvin (= ${source:Version})
 Description: The CloudStack Marvin integration tests
+
+Package: cloudstack-agent-ha-helper

Review comment:
       @GabrielBrascher we'll need similar for CentOS7, 8 and Suse packaging.
   Can you advise why we need a separate process, can't the agent do this or a thread in libvirtcomputingresource? And it seems it may create a security issue, is the service port (8080) secured and authenticated?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r708906048



##########
File path: scripts/vm/hypervisor/kvm/agent-ha-helper.py
##########
@@ -0,0 +1,126 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import logging
+import libvirt
+import socket
+import json
+import requests
+from http.server import BaseHTTPRequestHandler, HTTPServer
+
+log_folder = "/var/log/cloudstack/agent/"
+log_path = "/var/log/cloudstack/agent/agent-ha-helper.log"
+root_path = "/"
+check_path = "/check-neighbour/"
+http_ok = 200
+http_multiple_choices = 300
+http_not_found = 404
+
+class Libvirt():
+    def __init__(self):
+        self.conn = libvirt.openReadOnly("qemu:///system")
+        if not self.conn:
+            raise Exception('Failed to open connection to libvirt')
+
+    def running_vms(self):
+        alldomains = [domain for domain in map(self.conn.lookupByID, self.conn.listDomainsID())]
+
+        domains = []
+        for domain in alldomains:
+            if domain.info()[0] == libvirt.VIR_DOMAIN_RUNNING:
+                domains.append(domain.name())
+            elif domain.info()[0] == libvirt.VIR_DOMAIN_PAUSED:
+                domains.append(domain.name())
+
+        self.conn.close()
+
+        return domains
+
+class HTTPServerV6(HTTPServer):
+    address_family = socket.AF_INET6
+
+class CloudStackAgentHAHelper(BaseHTTPRequestHandler):
+    def do_GET(self):
+        if self.path == root_path:
+            libvirt = Libvirt()
+
+            running_vms = libvirt.running_vms()

Review comment:
       Can't agents get this information via mgmt server? Or by trying to connect on the neighbour's libvirtd process directly (over ssh/tcp/ssl)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-866317460


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_check_mark: debian. SL-JID 325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-926607182


   @NuxRo `http://10.0.33.2:8080/check-neighbour/10.0.34.165:8080` should return Up, in case the helper is running.
   
   ```
   ~$ curl http://kvm1:8080/check-neighbour/kvm2:8080
   {"status": "Up"}
   ```
   Note that this regards to the KVM HA helper that we've added; therefore it might return `Down` even if the `clodustack-agent` is `Up` in case the KVM HA helper is down; on the other way, it might also return Up if the `clodustack-agent` is `Down` and the KVM HA helper is `Up`.
   
   Can you please share a bit more details on your test so I can reproduce?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] sureshanaparti commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
sureshanaparti commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r634363635



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -81,7 +98,63 @@ public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException
 
     @Override
     public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(r);
+        boolean isKvmHaWebserviceEnabled = isKvmHaWebserviceEnabled(r);
+
+        isHealthy = isHealthViaNfs(r);
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        //TODO

Review comment:
       an empty TODO here, may be not required.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r657921262



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,256 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.host.Status;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import javax.inject.Inject;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ * <br>
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to Libvirt.
+ * <br>
+ * This way, KVM HA can verify, via Libvirt, VMs status with an HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final String STATUS = "status";
+    private static final String CHECK_NEIGHBOUR = "check-neighbour";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final JsonParser JSON_PARSER = new JsonParser();
+
+    @Inject
+    private VMInstanceDao vmInstanceDao;
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to Libvirt.
+     */
+    public int countRunningVmsOnAgent(Host host) {
+        String url = String.format("http://%s:%d", host.getPrivateIpAddress(), getKvmHaMicroservicePortValue(host));
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    protected int getKvmHaMicroservicePortValue(Host host) {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.",
+                    KVMHAConfig.KvmHaWebservicePort.defaultValue(), host.getClusterId(), host));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    public List<VMInstanceVO> listVmsOnHost(Host host) {
+        List<VMInstanceVO> listByHostAndStates = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running, VirtualMachine.State.Stopping, VirtualMachine.State.Migrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            long runningVMs = listByHostAndStates.stream().filter(vm -> vm.getState() == VirtualMachine.State.Running).count();
+            long stoppingVms = listByHostAndStates.stream().filter(vm -> vm.getState() == VirtualMachine.State.Stopping).count();
+            long migratingVms = listByHostAndStates.stream().filter(vm -> vm.getState() == VirtualMachine.State.Migrating).count();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent(host);
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", host.getName(), startingVMs, runningVMs,
+                            stoppingVms, migratingVms, listByHostAndStates.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndStates;
+    }
+
+    /**
+     *  Sends HTTP GET request from the host executing the KVM HA Agent webservice to a target Host (expected to also be running the KVM HA Agent).
+     *  The webserver serves a JSON Object such as {"status": "Up"} if the request gets a HTTP_OK OR {"status": "Down"} if HTTP GET failed
+     */
+    public boolean isHostReachableByNeighbour(Host neighbour, Host target) {
+        String neighbourHostAddress = neighbour.getPrivateIpAddress();
+        String targetHostAddress = target.getPrivateIpAddress();
+        int port = getKvmHaMicroservicePortValue(neighbour);
+        String url = String.format("http://%s:%d/%s/%s:%d", neighbourHostAddress, port, CHECK_NEIGHBOUR, targetHostAddress, port);
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return false;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null)
+            return false;
+
+        int statusCode = response.getStatusLine().getStatusCode();
+        if (isHttpStatusCodNotOk(statusCode)) {
+            LOGGER.error(
+                    String.format("Failed HTTP %s Request %s; the expected HTTP status code is '%s' but it got '%s'.", HttpGet.METHOD_NAME, url, EXPECTED_HTTP_STATUS, statusCode));
+            return false;
+        }
+
+        String hostStatusFromJson = responseInJson.get(STATUS).getAsString();
+        return Status.Up.toString().equals(hostStatusFromJson);
+    }
+
+    protected boolean isHttpStatusCodNotOk(int statusCode) {
+        return statusCode < HttpStatus.SC_OK || statusCode >= HttpStatus.SC_MULTIPLE_CHOICES;
+    }
+
+    /**
+     * Executes a GET request for the given URL address.
+     */
+    @Nullable
+    protected HttpResponse executeHttpRequest(String url) {
+        HttpGet httpReq = prepareHttpRequestForUrl(url);
+        if (httpReq == null) {
+            return null;
+        }
+
+        HttpClient client = HttpClientBuilder.create().build();
+        HttpResponse response = null;
+        try {
+            response = client.execute(httpReq);
+        } catch (IOException e) {
+            if (MAX_REQUEST_RETRIES == 0) {
+                LOGGER.warn(String.format("Failed to execute HTTP %s request [URL: %s] due to exception %s.", httpReq.getMethod(), url, e), e);
+                return null;
+            }
+            response = retryHttpRequest(url, httpReq, client);
+        }
+        return response;
+    }
+
+    @Nullable
+    private HttpGet prepareHttpRequestForUrl(String url) {
+        HttpGet httpReq = null;
+        try {
+            URIBuilder builder = new URIBuilder(url);
+            httpReq = new HttpGet(builder.build());
+        } catch (URISyntaxException e) {
+            LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
+            return null;
+        }
+        return httpReq;

Review comment:
       ```suggestion
           try {
               URIBuilder builder = new URIBuilder(url);
               return new HttpGet(builder.build());
           } catch (URISyntaxException e) {
               LOGGER.error(String.format("Failed to create URI for GET request [URL: %s] due to exception.", url), e);
               return null;
           }
   ```

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -59,29 +68,63 @@
     @Inject
     private AgentManager agentMgr;
     @Inject
-    private PrimaryDataStoreDao storagePool;
-    @Inject
     private StorageManager storageManager;
     @Inject
+    private PrimaryDataStoreDao storagePool;
+    @Inject
     private ResourceManager resourceManager;
+    @Inject
+    private StoragePoolHostDao storagePoolHostDao;
+    @Inject
+    private KvmHaHelper kvmHaHelper;
+
+    private static final Set<Storage.StoragePoolType> NFS_POOL_TYPE = new HashSet<>(Arrays.asList(Storage.StoragePoolType.NetworkFilesystem, Storage.StoragePoolType.ManagedNFS));
+    private static final Set<Hypervisor.HypervisorType> KVM_OR_LXC = new HashSet<>(Arrays.asList(Hypervisor.HypervisorType.KVM, Hypervisor.HypervisorType.LXC));
 
     @Override
-    public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException {
+    public boolean isActive(Host host, DateTime suspectTime) throws HACheckerException {
         try {
-            return isVMActivtyOnHost(r, suspectTime);
+            return isVMActivtyOnHost(host, suspectTime);
         } catch (HACheckerException e) {
             //Re-throwing the exception to avoid poluting the 'HACheckerException' already thrown
             throw e;
-        } catch (Exception e){
-            String message = String.format("Operation timed out, probably the %s is not reachable.", r.toString());
+        } catch (Exception e) {
+            String message = String.format("Operation timed out, probably the %s is not reachable.", host.toString());
             LOG.warn(message, e);
             throw new HACheckerException(message, e);
         }
     }
 
     @Override
-    public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+    public boolean isHealthy(Host host) {
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(host);
+        boolean isKvmHaWebserviceEnabled = kvmHaHelper.isKvmHaWebserviceEnabled(host);
+
+        if (isHostServedByNfsPool) {
+            isHealthy = isHealthViaNfs(host);
+        }
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        if (kvmHaHelper.isKvmHealthyCheckViaLibvirt(host) && !isHealthy) {
+            return true;
+        }
+
+        return isHealthy;
+    }
+
+    private boolean isHealthViaNfs(Host r) {
+        boolean isHealthy = true;
+        if (isHostServedByNfsPool(r)) {
+            isHealthy = isAgentActive(r);
+            if (!isHealthy) {
+                LOG.warn(String.format("NFS storage health check failed for %s. It seems that a storage does not have activity.", r.toString()));
+            }
+        }

Review comment:
       ```suggestion
       private boolean isHealthViaNfs(Host r) {
           if (!isHostServedByNfsPool(r)) {
               return true;
           }
   
           boolean isHealthy = isAgentActive(r);
           if (!isHealthy) {
               LOG.warn(String.format("NFS storage health check failed for %s. It seems that a storage does not have activity.", r.toString()));
           }
   ```

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -59,29 +68,63 @@
     @Inject
     private AgentManager agentMgr;
     @Inject
-    private PrimaryDataStoreDao storagePool;
-    @Inject
     private StorageManager storageManager;
     @Inject
+    private PrimaryDataStoreDao storagePool;
+    @Inject
     private ResourceManager resourceManager;
+    @Inject
+    private StoragePoolHostDao storagePoolHostDao;
+    @Inject
+    private KvmHaHelper kvmHaHelper;
+
+    private static final Set<Storage.StoragePoolType> NFS_POOL_TYPE = new HashSet<>(Arrays.asList(Storage.StoragePoolType.NetworkFilesystem, Storage.StoragePoolType.ManagedNFS));
+    private static final Set<Hypervisor.HypervisorType> KVM_OR_LXC = new HashSet<>(Arrays.asList(Hypervisor.HypervisorType.KVM, Hypervisor.HypervisorType.LXC));
 
     @Override
-    public boolean isActive(Host r, DateTime suspectTime) throws HACheckerException {
+    public boolean isActive(Host host, DateTime suspectTime) throws HACheckerException {
         try {
-            return isVMActivtyOnHost(r, suspectTime);
+            return isVMActivtyOnHost(host, suspectTime);
         } catch (HACheckerException e) {
             //Re-throwing the exception to avoid poluting the 'HACheckerException' already thrown
             throw e;
-        } catch (Exception e){
-            String message = String.format("Operation timed out, probably the %s is not reachable.", r.toString());
+        } catch (Exception e) {
+            String message = String.format("Operation timed out, probably the %s is not reachable.", host.toString());
             LOG.warn(message, e);
             throw new HACheckerException(message, e);
         }
     }
 
     @Override
-    public boolean isHealthy(Host r) {
-        return isAgentActive(r);
+    public boolean isHealthy(Host host) {
+        boolean isHealthy = true;
+        boolean isHostServedByNfsPool = isHostServedByNfsPool(host);
+        boolean isKvmHaWebserviceEnabled = kvmHaHelper.isKvmHaWebserviceEnabled(host);
+
+        if (isHostServedByNfsPool) {
+            isHealthy = isHealthViaNfs(host);
+        }
+
+        if (!isKvmHaWebserviceEnabled) {
+            return isHealthy;
+        }
+
+        if (kvmHaHelper.isKvmHealthyCheckViaLibvirt(host) && !isHealthy) {
+            return true;
+        }
+
+        return isHealthy;
+    }
+
+    private boolean isHealthViaNfs(Host r) {
+        boolean isHealthy = true;
+        if (isHostServedByNfsPool(r)) {
+            isHealthy = isAgentActive(r);
+            if (!isHealthy) {
+                LOG.warn(String.format("NFS storage health check failed for %s. It seems that a storage does not have activity.", r.toString()));
+            }
+        }

Review comment:
       This was suggested to avoid nested `if`. Maybe we could add some log to the first return too, like `...host is not served by a NFS pool...`.

##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaHelper.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.dc.ClusterVO;
+import com.cloud.dc.dao.ClusterDao;
+import com.cloud.host.Host;
+import com.cloud.host.HostVO;
+import com.cloud.host.Status;
+import com.cloud.resource.ResourceManager;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.NotNull;
+
+import javax.inject.Inject;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * This class provides methods that help the KVM HA process on checking hosts status as well as deciding if a host should be fenced/recovered or not.
+ */
+public class KvmHaHelper {
+
+    @Inject
+    protected ResourceManager resourceManager;
+    @Inject
+    protected KvmHaAgentClient kvmHaAgentClient;
+    @Inject
+    protected ClusterDao clusterDao;
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaHelper.class);
+    private static final double PROBLEMATIC_HOSTS_RATIO_ACCEPTED = 0.3;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+
+    private static final Set<Status> PROBLEMATIC_HOST_STATUS = new HashSet<>(Arrays.asList(Status.Alert, Status.Disconnected, Status.Down, Status.Error));
+
+    /**
+     * It checks the KVM node status via KVM HA Agent.
+     * If the agent is healthy it returns Status.Up, otherwise it keeps the provided Status as it is.
+     */
+    public Status checkAgentStatusViaKvmHaAgent(Host host, Status agentStatus) {
+        boolean isVmsCountOnKvmMatchingWithDatabase = isKvmHaAgentHealthy(host);
+        if (isVmsCountOnKvmMatchingWithDatabase) {
+            agentStatus = Status.Up;
+            LOGGER.debug(String.format("Checking agent %s status; KVM HA Agent is Running as expected.", agentStatus));
+        } else {
+            LOGGER.warn(String.format("Checking agent %s status. Failed to check host status via KVM HA Agent", agentStatus));
+        }
+        return agentStatus;
+    }
+
+    /**
+     * Given a List of Hosts, it lists Hosts that are in the following states:
+     * <ul>
+     *  <li> Status.Alert;
+     *  <li> Status.Disconnected;
+     *  <li> Status.Down;
+     *  <li> Status.Error.
+     * </ul>
+     */
+    @NotNull
+    protected List<HostVO> listProblematicHosts(List<HostVO> hostsInCluster) {
+        return hostsInCluster.stream().filter(neighbour -> PROBLEMATIC_HOST_STATUS.contains(neighbour.getStatus())).collect(Collectors.toList());
+    }
+
+    /**
+     * Returns false if the cluster has no problematic hosts or a small fraction of it.<br><br>
+     * Returns true if the cluster is problematic. A cluster is problematic if many hosts are in Down or Disconnected states, in such case it should not recover/fence.<br>
+     * Instead, Admins should be warned and check as it could be networking problems and also might not even have resources capacity on the few Healthy hosts at the cluster.
+     * <br><br>
+     * Admins can change the accepted ration of problematic hosts via global settings by updating configuration: "kvm.ha.accepted.problematic.hosts.ratio".
+     */
+    protected boolean isClusteProblematic(Host host) {
+        List<HostVO> hostsInCluster = resourceManager.listAllHostsInCluster(host.getClusterId());
+        List<HostVO> problematicNeighbors = listProblematicHosts(hostsInCluster);
+        int problematicHosts = problematicNeighbors.size();
+        int problematicHostsRatioAccepted = (int) (hostsInCluster.size() * KVMHAConfig.KvmHaAcceptedProblematicHostsRatio.value());
+
+        if (problematicHosts > problematicHostsRatioAccepted) {
+            ClusterVO cluster = clusterDao.findById(host.getClusterId());
+            LOGGER.warn(String.format("%s is problematic but HA will not fence/recover due to its cluster [id: %d, name: %s] containing %d problematic hosts (Down, Disconnected, "
+                            + "Alert or Error states). Maximum problematic hosts accepted for this cluster is %d.",
+                    host, cluster.getId(), cluster.getName(), problematicHosts, problematicHostsRatioAccepted));
+            return true;
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the given Host KVM-HA-Helper is reachable by another host in the same cluster.
+     */
+    protected boolean isHostAgentReachableByNeighbour(Host host) {
+        List<HostVO> neighbors = resourceManager.listHostsInClusterByStatus(host.getClusterId(), Status.Up);
+        for (HostVO neighbor : neighbors) {
+            boolean isVmActivtyOnNeighborHost = isKvmHaAgentHealthy(neighbor);
+            if (isVmActivtyOnNeighborHost) {
+                boolean isReachable = kvmHaAgentClient.isHostReachableByNeighbour(neighbor, host);
+                if (isReachable) {
+                    String.format("%s is reachable by neighbour %s. If CloudStack is failing to reach the respective host then it is probably a network issue between the host "
+                            + "and CloudStack management server.", host, neighbor);
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Returns true if the host is healthy. The health-check is performed via HTTP GET request to a service that retrieves Running KVM instances via Libvirt. <br>
+     * The health-check is executed on the KVM node and verifies the amount of VMs running and if the Libvirt service is running.
+     */
+    public boolean isKvmHealthyCheckViaLibvirt(Host host) {
+        boolean isKvmHaAgentHealthy = isKvmHaAgentHealthy(host);
+
+        if (!isKvmHaAgentHealthy) {
+            if (isClusteProblematic(host) || isHostAgentReachableByNeighbour(host)) {
+                return true;
+            }
+        }
+
+        return isKvmHaAgentHealthy;

Review comment:
       We could use a ternary on return here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-865316576


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899177843


   @blueorangutan test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r647002508



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -151,20 +193,34 @@ private boolean isVMActivtyOnHost(Host agent, DateTime suspectTime) throws HAChe
         if (agent.getHypervisorType() != Hypervisor.HypervisorType.KVM && agent.getHypervisorType() != Hypervisor.HypervisorType.LXC) {

Review comment:
       Thanks for the suggestion, I like this idea. Done!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r655677278



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GutoVeronezi I decided to remove this catch.
   When checking the easySend there is already enough catches. If it does not catch the exception ... I don't know what would catch it:
   
   ```
   public Answer easySend(final Long hostId, final Command cmd) {
           try {
                   ...
                   ...
                   ...
           } catch (final AgentUnavailableException e) {
               s_logger.warn(e.getMessage());
               return null;
           } catch (final OperationTimedoutException e) {
               s_logger.warn("Operation timed out: " + e.getMessage());
               return null;
           } catch (final Exception e) {
               s_logger.warn("Exception while sending", e);
               return null;
           }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899835219


   For reference, the tests that failed are:
   
   1. test_hostha_enable_ha_when_host_disabled
   2. test_hostha_enable_ha_when_host_in_maintenance
   
   I am still checking why they are failing and if it is related to this specific PR.
   
   Logs:
   ```
   testcase classname="tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_configure_default_driver" time="0.719"
   "tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_enable_ha_when_host_disabled" time="1.115" "marvin.cloudstackException.CloudstackAPIException" message="Execute cmd: updatehost failed, due to: errorCode: 530, errorText:Failed to update host:2,No next resource state found for current state = Maintenance event = Disable":
     File "/usr/lib64/python3.6/unittest/case.py", line 60, in testPartExecutor
       yield
     File "/usr/lib64/python3.6/unittest/case.py", line 622, in run
       testMethod()
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 259, in test_hostha_enable_ha_when_host_disabled
       self.disableHost(self.host.id)
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 609, in disableHost
       response = self.apiclient.updateHost(cmd)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackAPI/cloudstackAPIClient.py", line 915, in updateHost
       response = self.connection.marvinRequest(command, response_type=response, method=method)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 381, in marvinRequest
       raise e
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 376, in marvinRequest
       raise self.__lastError
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 310, in __parseAndGetResponse
       response_cls)
     File "/usr/local/lib/python3.6/site-packages/marvin/jsonHelper.py", line 155, in getResultObj
       raise cloudstackException.CloudstackAPIException(respname, errMsg)
   marvin.cloudstackException.CloudstackAPIException: Execute cmd: updatehost failed, due to: errorCode: 530, errorText:Failed to update host:2,No next resource state found for current state = Maintenance event = Disable
   === TestName: test_hostha_enable_ha_when_host_disabled | Status : EXCEPTION ===
   
   ------------------------
   
   "tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_enable_ha_when_host_disconected" time="14.522"
   "checkForState:: expected=Ineligible, actual={haenable : True, hastate : 'Ineligible', haprovider : 'kvmhaprovider'}
   ]]></system-out></testcase><testcase classname="tests.smoke.test_hostha_kvm.TestHAKVM" name="test_hostha_enable_ha_when_host_in_maintenance" time="303.923" message="Job failed: {accountid : '5e5de944-fe40-11eb-9d50-1e003b000428', userid : '5e5edcd4-fe40-11eb-9d50-1e003b000428', cmd : 'org.apache.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd', jobstatus : 2, jobprocstatus : 0, jobresultcode : 530, jobresulttype : 'object', jobresult : {errorcode : 530, errortext : 'Failed to prepare host for maintenance due to: Host is already in state Maintenance. Cannot recall for maintenance until resolved.'}, jobinstancetype : 'Host', jobinstanceid : '23cb0154-392a-4af3-80e2-3bf0d24bc341', created : '2021-08-16T13:19:01+0000', completed : '2021-08-16T13:19:01+0000', jobid : '946cb1c8-37cb-4110-b5ea-529da895d05e':
     File "/usr/lib64/python3.6/unittest/case.py", line 60, in testPartExecutor
       yield
     File "/usr/lib64/python3.6/unittest/case.py", line 622, in run
       testMethod()
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 285, in test_hostha_enable_ha_when_host_in_maintenance
       self.setHostToMaintanance(self.host.id)
     File "/marvin/tests/smoke/test_hostha_kvm.py", line 623, in setHostToMaintanance
       response = self.apiclient.prepareHostForMaintenance(cmd)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackAPI/cloudstackAPIClient.py", line 2435, in prepareHostForMaintenance
       response = self.connection.marvinRequest(command, response_type=response, method=method)
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 381, in marvinRequest
       raise e
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 376, in marvinRequest
       raise self.__lastError
     File "/usr/local/lib/python3.6/site-packages/marvin/cloudstackConnection.py", line 105, in __poll
       % async_response)
   Exception: Job failed: {accountid : '5e5de944-fe40-11eb-9d50-1e003b000428', userid : '5e5edcd4-fe40-11eb-9d50-1e003b000428', cmd : 'org.apache.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd', jobstatus : 2, jobprocstatus : 0, jobresultcode : 530, jobresulttype : 'object', jobresult : {errorcode : 530, errortext : 'Failed to prepare host for maintenance due to: Host is already in state Maintenance. Cannot recall for maintenance until resolved.'}, jobinstancetype : 'Host', jobinstanceid : '23cb0154-392a-4af3-80e2-3bf0d24bc341', created : '2021-08-16T13:19:01+0000', completed : '2021-08-16T13:19:01+0000', jobid : '946cb1c8-37cb-4110-b5ea-529da895d05e'}
   === TestName: test_hostha_enable_ha_when_host_in_maintenance | Status : EXCEPTION ===
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-897071944


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908413093


   @DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-897043975


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-865323546


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_multiplication_x: debian. SL-JID 315


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1056853871


   Hi @GabrielBrascher is this PR ready or still in progress? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] nvazquez commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
nvazquez commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063214604


   @blueorangutan test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] PaulAngus commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
PaulAngus commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-933496393


   Correct me if I'm wrong, but this looks like its relying on HTTP over 8080?
   
   This will absolutely have to be HTTPS with strong auth at both ends.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-1063126480


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-914908991


   @blueorangutan package 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-914931587


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_multiplication_x: suse15. SL-JID 1157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-897068424


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian. SL-JID 847


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-915522164


   <b>Trillian test result (tid-1989)</b>
   Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
   Total time taken: 47348 seconds
   Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4978-t1989-kvm-centos7.zip
   Smoke tests completed. 87 look OK, 2 have errors
   Only failed tests results shown below:
   
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   test_07_deploy_kubernetes_ha_cluster | `Failure` | 3608.97 | test_kubernetes_clusters.py
   test_08_deploy_and_upgrade_kubernetes_ha_cluster | `Failure` | 0.07 | test_kubernetes_clusters.py
   test_09_delete_kubernetes_ha_cluster | `Failure` | 0.05 | test_kubernetes_clusters.py
   ContextSuite context=TestKubernetesCluster>:teardown | `Error` | 73.57 | test_kubernetes_clusters.py
   test_hostha_kvm_host_degraded | `Failure` | 769.91 | test_hostha_kvm.py
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-875991759


   Fixed conflict.
   As the conflict impacted some of the initial commits it was kind of tricky to rebase the branch `kvm-ha-microservice-client` to the updated `main` which led to this list of updated commits/hash.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r678264556



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHostActivityChecker.java
##########
@@ -213,6 +270,17 @@ protected boolean verifyActivityOfStorageOnHost(HashMap<StoragePool, List<Volume
         return poolVolMap;
     }
 
+    private boolean isHostServedByNfsPool(Host agent) {
+        List<StoragePoolHostVO> storagesOnHost = storagePoolHostDao.listByHostId(agent.getId());
+        for (StoragePoolHostVO storagePoolHostRef : storagesOnHost) {
+            StoragePoolVO storagePool = this.storagePool.findById(storagePoolHostRef.getPoolId());
+            if (NFS_POOL_TYPE.contains(storagePool.getPoolType())) {
+                return true;
+            }
+        }
+        return false;

Review comment:
       @GutoVeronezi here I think that it is a bit trickier than a normal loop interation with a simple if.
   Such as the other one:
   ```
   for (StoragePoolVO pool : zonePools) {
           if (pool.getPoolType() == StoragePoolType.NetworkFilesystem) {
                   return true;
           }
   }
   ```
   
   Here the list does not store the object that is being used in the conditional, but instead just an object (`StoragePoolHostVO`) that will serve to link to the object in quesiton (`StoragePoolVO`). Thus the stream .anyMatch would not serve here.
   
   Please correct me if I am wrong in this one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-881619447


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r629572678



##########
File path: plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KvmHaAgentClient.java
##########
@@ -0,0 +1,271 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cloudstack.kvm.ha;
+
+import com.cloud.host.Host;
+import com.cloud.utils.exception.CloudRuntimeException;
+import com.cloud.vm.VMInstanceVO;
+import com.cloud.vm.VirtualMachine;
+import com.cloud.vm.dao.VMInstanceDao;
+import com.google.gson.JsonObject;
+import com.google.gson.JsonParser;
+import org.apache.commons.httpclient.HttpStatus;
+import org.apache.http.HttpResponse;
+import org.apache.http.client.HttpClient;
+import org.apache.http.client.methods.HttpGet;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.client.utils.URIBuilder;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.log4j.Logger;
+import org.jetbrains.annotations.Nullable;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class provides a client that checks Agent status via a webserver.
+ *
+ * The additional webserver exposes a simple JSON API which returns a list
+ * of Virtual Machines that are running on that host according to libvirt.
+ *
+ * This way, KVM HA can verify, via libvirt, VMs status with a HTTP-call
+ * to this simple webserver and determine if the host is actually down
+ * or if it is just the Java Agent which has crashed.
+ */
+public class KvmHaAgentClient {
+
+    private static final Logger LOGGER = Logger.getLogger(KvmHaAgentClient.class);
+    private static final int ERROR_CODE = -1;
+    private static final String EXPECTED_HTTP_STATUS = "2XX";
+    private static final String VM_COUNT = "count";
+    private static final int WAIT_FOR_REQUEST_RETRY = 2;
+    private static final int MAX_REQUEST_RETRIES = 2;
+    private static final int CAUTIOUS_MARGIN_OF_VMS_ON_HOST = 1;
+    private Host agent;
+
+    /**
+     * Instantiates a webclient that checks, via a webserver running on the KVM host, the VMs running
+     */
+    public KvmHaAgentClient(Host agent) {
+        this.agent = agent;
+    }
+
+    /**
+     *  Returns the number of VMs running on the KVM host according to libvirt.
+     */
+    protected int countRunningVmsOnAgent() {
+        String url = String.format("http://%s:%d", agent.getPrivateIpAddress(), getKvmHaMicroservicePortValue());
+        HttpResponse response = executeHttpRequest(url);
+
+        if (response == null)
+            return ERROR_CODE;
+
+        JsonObject responseInJson = processHttpResponseIntoJson(response);
+        if (responseInJson == null) {
+            return ERROR_CODE;
+        }
+
+        return responseInJson.get(VM_COUNT).getAsInt();
+    }
+
+    protected int getKvmHaMicroservicePortValue() {
+        Integer haAgentPort = KVMHAConfig.KvmHaWebservicePort.value();
+        if (haAgentPort == null) {
+            LOGGER.warn(String.format("Using default kvm.ha.webservice.port: %s as it was set to NULL for the cluster [id: %d] from %s.", KVMHAConfig.KvmHaWebservicePort.defaultValue(), agent.getClusterId(), agent));
+            haAgentPort = Integer.parseInt(KVMHAConfig.KvmHaWebservicePort.defaultValue());
+        }
+        return haAgentPort;
+    }
+
+    /**
+     * Checks if the KVM HA Webservice is enabled or not; if disabled then CloudStack ignores HA validation via the webservice.
+     */
+    public boolean isKvmHaWebserviceEnabled() {
+        return KVMHAConfig.IsKvmHaWebserviceEnabled.value();
+    }
+
+    /**
+     * Lists VMs on host according to vm_instance DB table. The states considered for such listing are: 'Running', 'Stopping', 'Migrating'.
+     * <br>
+     * <br>
+     * Note that VMs on state 'Starting' are not common to be at the host, therefore this method does not list them.
+     * However, there is still a probability of a VM in 'Starting' state be already listed on the KVM via '$virsh list',
+     * but that's not likely and thus it is not relevant for this very context.
+     */
+    protected List<VMInstanceVO> listVmsOnHost(Host host, VMInstanceDao vmInstanceDao) {
+        List<VMInstanceVO> listByHostAndStateRunning = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Running);
+        List<VMInstanceVO> listByHostAndStateStopping = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Stopping);
+        List<VMInstanceVO> listByHostAndStateMigrating = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Migrating);
+
+        List<VMInstanceVO> listByHostAndState = new ArrayList<>();
+        listByHostAndState.addAll(listByHostAndStateRunning);
+        listByHostAndState.addAll(listByHostAndStateStopping);
+        listByHostAndState.addAll(listByHostAndStateMigrating);
+
+        if (LOGGER.isTraceEnabled()) {
+            List<VMInstanceVO> listByHostAndStateStarting = vmInstanceDao.listByHostAndState(host.getId(), VirtualMachine.State.Starting);
+            int startingVMs = listByHostAndStateStarting.size();
+            int runningVMs = listByHostAndStateRunning.size();
+            int stoppingVms = listByHostAndStateStopping.size();
+            int migratingVms = listByHostAndStateMigrating.size();
+            int countRunningVmsOnAgent = countRunningVmsOnAgent();
+            LOGGER.trace(
+                    String.format("%s has (%d Starting) %d Running, %d Stopping, %d Migrating. Total listed via DB %d / %d (via libvirt)", agent.getName(), startingVMs, runningVMs, stoppingVms,
+                            migratingVms, listByHostAndState.size(), countRunningVmsOnAgent));
+        }
+
+        return listByHostAndState;
+    }
+
+    /**
+     *  Returns true in case of the expected number of VMs matches with the VMs running on the KVM host according to Libvirt. <br><br>
+     *
+     *  IF: <br>
+     *  (i) KVM HA agent finds 0 running but CloudStack considers that the host has 2 or more VMs running: returns false as could not find VMs running but it expected at least
+     *    2 VMs running, fencing/recovering host would avoid downtime to VMs in this case.<br>
+     *  (ii) KVM HA agent finds 0 VM running but CloudStack considers that the host has 1 VM running: return true and log WARN messages and avoids triggering HA recovery/fencing
+     *    when it could be a inconsistency when migrating a VM.<br>
+     *  (iii) amount of listed VMs is different than expected: return true and print WARN messages so Admins can monitor and react accordingly
+     */
+    public boolean isKvmHaAgentHealthy(Host host, VMInstanceDao vmInstanceDao) {
+        int numberOfVmsOnHostAccordingToDB = listVmsOnHost(host, vmInstanceDao).size();

Review comment:
       Simple change just to let it a consistent camelcase
   ```java
   int numberOfVmsOnHostAccordingToDb = listVmsOnHost(host, vmInstanceDao).size();
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-899833092


   @GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-897088365


   Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian. SL-JID 848


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] DaanHoogland commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
DaanHoogland commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-908412813


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GutoVeronezi commented on a change in pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GutoVeronezi commented on a change in pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#discussion_r655691011



##########
File path: plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java
##########
@@ -101,24 +115,29 @@ public Status isAgentAlive(Host agent) {
                 hostStatus = answer.getResult() ? Status.Down : Status.Up;
             }
         } catch (Exception e) {
-            s_logger.debug("Failed to send command to host: " + agent.getId());
+            s_logger.debug(String.format("Failed to send command to %s", agent));

Review comment:
       @GabrielBrascher indeed, I see no way to it throw an exception




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-870343799


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-870390962


   Packaging result: :heavy_multiplication_x: centos7 :heavy_multiplication_x: centos8 :heavy_check_mark: debian. SL-JID 407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-897043667






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] blueorangutan commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
blueorangutan commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-865317286






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] GabrielBrascher commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
GabrielBrascher commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-849703052


   @rhtyd thanks for running the packaging.
   
   I think I know why it is failing for Centos7-8: I still need to cover centos packaging for the new `agent-ha-helper`. For now it is packed only for Ubuntu. I shall update this PR in order to allow it.
   
   This also makes sense on why all Jenkins/Travis checks are green but packaging results for centos are not good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [cloudstack] rhtyd commented on pull request #4978: KVM High Availability regardless of storage

Posted by GitBox <gi...@apache.org>.
rhtyd commented on pull request #4978:
URL: https://github.com/apache/cloudstack/pull/4978#issuecomment-870343799


   @blueorangutan package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org