You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2023/01/11 19:20:38 UTC
[GitHub] [solr-operator] joshsouza commented on pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

joshsouza commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1379371469

   At the moment we're not tuning the liveness/readiness/startup probes from the defaults the operator provides.
   That said, here are my thoughts:
   1. Liveness and Readiness are two different concepts, and the general advice I've seen settled on is that they should rarely be the same check.
       - Liveness should ensure that the application is running (hasn't errored in such a manner that pid1 in the container is up, but the app is not running). When this check fails, the pod should be terminated entirely.
       - Startup should check the same thing liveness does, but account for the worst-case startup delay (the app takes 5 minutes to boot etc...). The Startup probe should take precedence over an initial delay for the liveness check (since the liveness check won't begin until after the startup probe finishes), and generally isn't needed unless the app takes time to boot up.
       - Readiness should ensure that the application is prepared to handle incoming requests, and should be considered a good target for the load balancer/service to send traffic to (Usually a live endpoint, in this case the metrics one makes a lot of sense to me)
   2. I'm not sure if using the same endpoint for both liveness and readiness is appropriate for this particular application, but the questions that come to mind are: Can this be overloaded to the point where it's still valid/running, but needs to process requests in flight before it can handle more? If so, we need to make sure the liveness probe _won't_ fail in that scenario, but the readiness check _would_, so that it stops receiving new traffic, but is permitted to finish the in-flight requests.
   3. Given what I can infer here, (that there's only the one endpoint, and it could be overloaded, but we wouldn't want it to remain overloaded for an extended period of time) I would suggest (opinions, definitely up for debate): 
       - Increase the failure threshold for the startup probe from 5 to 15 (right now the pod has 10s to start, or it is considered a failure. I would bump that to 30s, but that's just me)
       - Increase the failure threshold on the liveness probe from 3 to 6 (with a period of 10, this means that if the pod can't handle any requests for a minute, it's probably best to kill it off)
       - Reduce the period on the readiness probe from 10s to 5s (so if it can't respond for 15s, drop it from the service. That gives it 45s to sort out requests already in flight before being considered dead)
       - Or, ideally, separate the readiness and liveness checks to match their purpose more closely (checking the pid is running for liveness etc...)
   
   Just some brain-dump on my gut reaction, very much open to discussion and/or correction. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org