You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jan Høydahl (Jira)" <ji...@apache.org> on 2020/03/29 20:32:00 UTC
[jira] [Comment Edited] (SOLR-14210) Introduce Node-level status handler for replicas

    [ https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070166#comment-17070166 ] 

Jan Høydahl edited comment on SOLR-14210 at 3/29/20, 8:31 PM:
--------------------------------------------------------------

See https://github.com/apache/lucene-solr/pull/1387 for a first attempt of this. If param {{&failWhenRecovering=true}} is passed to {{/api/node/health}} then it will return 503 if one or more replicas on the node is in state {{DOWN}} or {{RECOVERING}}.


was (Author: janhoy):
See https://github.com/apache/lucene-solr/pull/1387 for a first attempt of this. If param {{&failWhenRecovering=true}} is passed to {{/api/node/health}} then it will return 503 if one or more cores on the node are in states {{RECOVERY}} or {{CONSTRUCTION}}.

> Introduce Node-level status handler for replicas
> ------------------------------------------------
>
>                 Key: SOLR-14210
>                 URL: https://issues.apache.org/jira/browse/SOLR-14210
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: master (9.0), 8.5
>            Reporter: Houston Putman
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h2. Background
> As was brought up in SOLR-13055, in order to run Solr in a more cloud-native way, we need some additional features around node-level healthchecks.
> {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe explained in [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n] determine if a node is live and ready to serve live traffic.
> {quote}
>  
> However there are issues around kubernetes managing it's own rolling restarts. With the current healthcheck setup, it's easy to envision a scenario in which Solr reports itself as "healthy" when all of its replicas are actually recovering. Therefore kubernetes, seeing a healthy pod would then go and restart the next Solr node. This can happen until all replicas are "recovering" and none are healthy. (maybe the last one restarted will be "down", but still there are no "active" replicas)
> h2. Proposal
> I propose we make an additional healthcheck handler that returns whether all replicas hosted by that Solr node are healthy and "active". That way we will be able to use the [default kubernetes rolling restart logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies] with Solr.
> To add on to [Jan's point here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559], this handler should be more friendly for other Content-Types and should use bettter HTTP response statuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org