You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2022/12/22 21:16:56 UTC

[GitHub] [solr-operator] HoustonPutman opened a new pull request, #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

HoustonPutman opened a new pull request, #511:
URL: https://github.com/apache/solr-operator/pull/511

   Resolves #510 
   
   This is a WIP but the goal is to have smarter probes that work better for most installations of Solr and the Prometheus Exporter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr-operator] HoustonPutman commented on pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

Posted by "HoustonPutman (via GitHub)" <gi...@apache.org>.

HoustonPutman commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1402784282

   Thanks for both of your thoughts!
   
   As for the question on whether the liveness/readiness endpoint should be the same, I do not think they should be eventually. I like `admin/info/system` as the handler for liveness (the same as it is now), since that basically just responds if solr is running. When 8.0 is our minimum version (soon) then using `admin/info/health` would be great for the readiness probe since we want to make sure that Solr can connect to Zookeeper. Eventually adding a parameter that says most replicas on the host are healthy could be useful, but I think ZK connection is a good place to start.
   
   > Can this be overloaded to the point where it's still valid/running, but needs to process requests in flight before it can handle more? If so, we need to make sure the liveness probe won't fail in that scenario, but the readiness check would, so that it stops receiving new traffic, but is permitted to finish the in-flight requests.
   
   I think this is hard to do since a lot of the request handling could be updates and queries for specific collections, which we can't know... But definitely agree it would be great to get to this in the end.
   
   >  I would suggest (opinions, definitely up for debate)
   
   (The actual bulk of the changes happening in this PR)
   
   **Yeah I've upped the number of checks in the startup probe to 10, giving the pod 1 minute to become healthy. I think that should be enough for the Solr server to start.**
   
   **For the others I agree, I have it set as 3 20s checks for the liveness probes, giving us a 40s-1m of "downtime" before taking down the pod. Readiness is set as 2 10 second checks, so if zk isn't available for 10-20 seconds, requests won't be routed to that node. But if its a blip, 1 good readiness check and its back in the list.** 
   
   > To be candid, I'm not sure if the operator configures communication between nodes via services or directly with pod names. If configured with services then readiness probes could impact communication between nodes in the SolrCloud.
   
   So for node-level endpoints (headless service and individual pod services for ingresses), the readiness check is not used for routing, since we use the `publishNotReadyAddresses: true` option for these services. The only service that doesn't use this option is the solrcloud-common service, which is what the readiness probe would be impacting. Also the solr operator's rolling restart logic only uses the readiness probe when calculating the maxPodsDown option, so a readiness probe that is more likely to return errors is going to slow down rolling restarts, but probably not to a large degree.
   
   > On more consideration we commonly see in the wild is negative feedback loops for liveness probes.
   
   Yeah this is definitely not something we want to take lightly. We only want to restart solr nodes when absolutely necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr-operator] HoustonPutman commented on pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

Posted by GitBox <gi...@apache.org>.

HoustonPutman commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1379278452

   This ticket is important since the e2e-tests will timeout with the unoptimized probes currently in use. We need to merge this before we can merge the e2e-tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr-operator] HoustonPutman merged pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

Posted by "HoustonPutman (via GitHub)" <gi...@apache.org>.

HoustonPutman merged PR #511:
URL: https://github.com/apache/solr-operator/pull/511


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr-operator] joshsouza commented on pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

Posted by GitBox <gi...@apache.org>.

joshsouza commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1379371469

   At the moment we're not tuning the liveness/readiness/startup probes from the defaults the operator provides.
   That said, here are my thoughts:
   1. Liveness and Readiness are two different concepts, and the general advice I've seen settled on is that they should rarely be the same check.
       - Liveness should ensure that the application is running (hasn't errored in such a manner that pid1 in the container is up, but the app is not running). When this check fails, the pod should be terminated entirely.
       - Startup should check the same thing liveness does, but account for the worst-case startup delay (the app takes 5 minutes to boot etc...). The Startup probe should take precedence over an initial delay for the liveness check (since the liveness check won't begin until after the startup probe finishes), and generally isn't needed unless the app takes time to boot up.
       - Readiness should ensure that the application is prepared to handle incoming requests, and should be considered a good target for the load balancer/service to send traffic to (Usually a live endpoint, in this case the metrics one makes a lot of sense to me)
   2. I'm not sure if using the same endpoint for both liveness and readiness is appropriate for this particular application, but the questions that come to mind are: Can this be overloaded to the point where it's still valid/running, but needs to process requests in flight before it can handle more? If so, we need to make sure the liveness probe _won't_ fail in that scenario, but the readiness check _would_, so that it stops receiving new traffic, but is permitted to finish the in-flight requests.
   3. Given what I can infer here, (that there's only the one endpoint, and it could be overloaded, but we wouldn't want it to remain overloaded for an extended period of time) I would suggest (opinions, definitely up for debate): 
       - Increase the failure threshold for the startup probe from 5 to 15 (right now the pod has 10s to start, or it is considered a failure. I would bump that to 30s, but that's just me)
       - Increase the failure threshold on the liveness probe from 3 to 6 (with a period of 10, this means that if the pod can't handle any requests for a minute, it's probably best to kill it off)
       - Reduce the period on the readiness probe from 10s to 5s (so if it can't respond for 15s, drop it from the service. That gives it 45s to sort out requests already in flight before being considered dead)
       - Or, ideally, separate the readiness and liveness checks to match their purpose more closely (checking the pid is running for liveness etc...)
   
   Just some brain-dump on my gut reaction, very much open to discussion and/or correction. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr-operator] endzyme commented on pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

Posted by GitBox <gi...@apache.org>.

endzyme commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1380505771

Thanks for reaching out about considerations. As Josh mentioned, we're not deviating from the current defaults in 0.6.0.

I wanted to addon to what Josh has mentioned above.

This application is a little different than other apps because all the nodes are clustered and have traffic routing among them based on what data they hold for what collections. This is important because it changes the contextual reason behind readiness probes. Readiness probes mostly affect the "status" of the endpoint associating the pod to any services it's a part of. Since traffic can still hit nodes, even when they're pulled from the service, then the readiness probe doesn't really have much effect on incoming requests. I'd have to think a little more on the intended value of readiness with how Solr works. To be candid, I'm not sure if the operator configures communication between nodes via services or directly with pod names. If configured with services then readiness probes could impact communication between nodes in the SolrCloud.

As for liveness probes, the main value I see is when restarting the java process would actually resolve a problem. Liveness should really only trigger when the node cannot perform the most basic tasks but the process still appears to be running. Things that come to mind are an inability to read from disk, causing many 500s but still technically "running". Another could be "runaway threads" which bomb the process and should be terminated to recover service availability. That said, it should be blend of a service critical KPI with how long you'd be comfortable having this pod unavailable.

All the above are just considerations for original intent of the tools and should definitely be considered with how SolrCloud is intended to operate.

On more consideration we commonly see in the wild is negative feedback loops for liveness probes. This one is pretty tricky but the simplest example is that an application experiencing too much load can trigger its liveness probe, which then restart the app. While the app is restarting, it causes more load on the remaining pods, causing them to also cascade and trigger their liveness probes. These are usually the result at aggressive setting on liveness probes. Generally, in Java, the largest contributing factor would be resource starvation, like under allotting CPU or memory, which could lead to GC issues.

Anyway, not sure if the last paragraph is very actionable, but something to keep in my mind on initially tuning for aggressive restarts or more acceptable time to wait before restarting a service.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org

[GitHub] [solr-operator] HoustonPutman commented on pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

Posted by GitBox <gi...@apache.org>.

HoustonPutman commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1379244686

   Would love some feedback on what users have been setting for their liveness/readiness/startup probes!
   
   @janhoy @endzyme @joshsouza @nosvalds 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org