You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/03/30 23:19:24 UTC

[GitHub] [lucene-solr] HoustonPutman commented on a change in pull request #1387: SOLR-14210: Include replica health in healtcheck handler

HoustonPutman commented on a change in pull request #1387: SOLR-14210: Include replica health in healtcheck handler
URL: https://github.com/apache/lucene-solr/pull/1387#discussion_r400326019
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/handler/admin/HealthCheckHandler.java
 ##########
 @@ -88,15 +95,42 @@ public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throw
       return;
     }
 
-    // Set status to true if this node is in live_nodes
-    if (clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
-      rsp.add(STATUS, OK);
-    } else {
+    // Fail if not in live_nodes
+    if (!clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
       rsp.add(STATUS, FAILURE);
       rsp.setException(new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "Host Unavailable: Not in live nodes as per zk"));
+      return;
     }
 
-    rsp.setHttpCaching(false);
+    // Optionally require that all cores on this node are active if param 'failWhenRecovering=true'
+    if (req.getParams().getBool(PARAM_REQUIRE_HEALTHY_CORES, false)) {
+      List<String> unhealthyCores = findUnhealthyCores(clusterState, cores.getNodeConfig().getNodeName());
+      if (unhealthyCores.size() > 0) {
+          rsp.add(STATUS, FAILURE);
+          rsp.setException(new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,
+                  "Replica(s) " + unhealthyCores + " are currently initializing or recovering"));
+          return;
+      }
+      rsp.add("MESSAGE", "All cores are healthy");
+    }
+
+    // All lights green, report healthy
+    rsp.add(STATUS, OK);
+  }
+
+  /**
+   * Find replicas DOWN or RECOVERING
+   * @param clusterState clusterstate from ZK
+   * @param nodeName this node name
+   * @return list of core names that are either DOWN ore RECOVERING on 'nodeName'
+   */
+  static List<String> findUnhealthyCores(ClusterState clusterState, String nodeName) {
+    return clusterState.getCollectionsMap().values().stream()
 
 Review comment:
   This should still be relatively fast with hundreds of collections and thousands of replicas.
   But it would be nice to get some performance tests before this gets merged in.
   
   One question I have, since I'm not too familiar with "active" slices. "Inactive" slices are the new shards from a shard split that has not completed yet, right? If so maybe we want to return false if there are any replicas from inactive slices on the node. Otherwise taking the node down could possibly hamper a shard-split.
   
   Please correct me if I'm wrong on any of those statements.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org