You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2023/01/12 14:59:35 UTC

[GitHub] [solr-operator] endzyme commented on pull request #511: Use smarter probes for SolrCloud and SolrPrometheusExporter

endzyme commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1380505771

Thanks for reaching out about considerations. As Josh mentioned, we're not deviating from the current defaults in 0.6.0.

I wanted to addon to what Josh has mentioned above.

This application is a little different than other apps because all the nodes are clustered and have traffic routing among them based on what data they hold for what collections. This is important because it changes the contextual reason behind readiness probes. Readiness probes mostly affect the "status" of the endpoint associating the pod to any services it's a part of. Since traffic can still hit nodes, even when they're pulled from the service, then the readiness probe doesn't really have much effect on incoming requests. I'd have to think a little more on the intended value of readiness with how Solr works. To be candid, I'm not sure if the operator configures communication between nodes via services or directly with pod names. If configured with services then readiness probes could impact communication between nodes in the SolrCloud.

As for liveness probes, the main value I see is when restarting the java process would actually resolve a problem. Liveness should really only trigger when the node cannot perform the most basic tasks but the process still appears to be running. Things that come to mind are an inability to read from disk, causing many 500s but still technically "running". Another could be "runaway threads" which bomb the process and should be terminated to recover service availability. That said, it should be blend of a service critical KPI with how long you'd be comfortable having this pod unavailable.

All the above are just considerations for original intent of the tools and should definitely be considered with how SolrCloud is intended to operate.

On more consideration we commonly see in the wild is negative feedback loops for liveness probes. This one is pretty tricky but the simplest example is that an application experiencing too much load can trigger its liveness probe, which then restart the app. While the app is restarting, it causes more load on the remaining pods, causing them to also cascade and trigger their liveness probes. These are usually the result at aggressive setting on liveness probes. Generally, in Java, the largest contributing factor would be resource starvation, like under allotting CPU or memory, which could lead to GC issues.

Anyway, not sure if the last paragraph is very actionable, but something to keep in my mind on initially tuning for aggressive restarts or more acceptable time to wait before restarting a service.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org