You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2022/09/09 22:26:54 UTC

[GitHub] [solr-operator] joshsouza opened a new issue, #471: How to prevent node rotation behavior from causing cluster instability

joshsouza opened a new issue, #471:
URL: https://github.com/apache/solr-operator/issues/471

   We are just starting out with the Solr Operator, and intend on moving several large Solr clusters over to leveraging the operator for their management. In our initial tests, we've encountered a situation that seems incredibly risky, and we would like to understand whether there is a reasonable solution for this in place, or good suggestions for how to improve reliability around it.
   
   The logic around `SolrCloud.Spec.updateStrategy` being `Managed` (https://apache.github.io/solr-operator/docs/solr-cloud/solr-cloud-crd.html#update-strategy) means that the operator will never take an action that risks cluster stability (shutting down a pod that would result in no live replicas etc...) This is fantastic, but only relates to actions that the operator itself would make (statefulset updates etc...), and doesn't appear to come into play during normal _kubernetes_ operations, such as node rotations.
   
   On an EKS cluster, when a node group is refreshed, the nodes are marked for termination within their autoscaling groups, and subsequently their pods are drained from the nodes that are to be shut down, and re-scheduled to valid nodes. Normal k8s operations to prevent service disruptions during this type of an event are to utilize Pod Disruption Budgets, which prevents the draining nodes from stopping their pods if it would cause a disruption. This leverages Readiness/Liveness status to determine when a disruption would occur, and is generally a reliable way of preventing applications from becoming unavailable.
   
   With Solr, there is another level of abstraction, as a Solr pod being "ready" doesn't mean that all of the cores on that node are available/replicated, and thus a pod disruption budget, which only monitors that readiness state, may perceive that it is safe to delete an arbitrary pod in the cluster without the necessary logic (which the Operator has) of checking whether that pod would cause a disruption should it be shut down.
   
   Since with a large cluster, nodes/pods coming up and down may take time to recover, and without a PDB, you may risk multiple pods going down simultaneously, there is a risk (we perceive) that Solr's availability could be at risk should a node rotation or other form of pod deletion etc... occur outside the Operator's pervue.
   
   So, my question is:
   What methodology is recommended for eliminating this risk? Are there configurations we've overlooked that will reduce this risk? Has the community simply accepted this limitation and found ways to reduce the odds of being impacted? (are we maybe overreacting, and this isn't actually a risk?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] HoustonPutman commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
HoustonPutman commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1249533894

   > Just had a thought on this after perusing the docs further to see if there's anything I could find to support our end goals within current constraints: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
   I can specify a disruption maxUnavailable of 0. This will prevent any voluntary disruptions entirely.
   
   That is very interesting, and could certainly be something for us to look into.
   
   If we go further down that idea, we _could_ have a PDB for each pod individually, and basically set the `minAvailable` to either 0 or 1 depending on whether it's ok to take down that pod at any given time (given the same logic we use for restarts). That gives us a much more fine-tuned ability to control this.
   
   > It also occurred to me that if each SolrCloud had a PDB with a maxUnavailable of 0 at all times, the Solr Operator could monitor the cluster for node rotation behavior
   
   This is probably the best solution, if we can get it right. There are things the Solr Operator generally wants to control before letting a pod get deleted, such as moving replicas off of Solr Node with ephemeral data. So if we are able to do that then I think we go for it.
   
   The new [DistruptionCondition](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions) stuff might give us that info, but its alpha in `v1.25`, so probably won't be available by default for at least a few more versions. I'm also not sure if it will put the condition on the pod if the PDB says not to delete it... But it would certainly be the easiest way forward if we wanted to do this.
   
   Either way, we don't need to be perfect from the beginning. I say that for now, we either go cluster-wide PDB or do per-pod PDBs. But I absolutely love this discussion, and with a few new versions of Kubernetes, we can probably get this to an amazing place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] joshsouza commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
joshsouza commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1248260669

   It also occurred to me that if each SolrCloud had a PDB with a `maxUnavailable` of `0` _at all times_, the Solr Operator could monitor the cluster for node rotation behavior (node drain events, etc....) and take the appropriate action itself (I believe pods can still be deleted, or otherwise shut down). I neither know what would need to be monitored, nor how to shut down Solr pods while a PDB would normally prevent actions, but that may be a thought process worth pursuing at some point, since the Operator already has the logic baked in to know when a pod is safe to delete/disrupt.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] joshsouza commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
joshsouza commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1249574050

   Thanks for all the thoughtful discussion. It hadn't even occurred to me to
   do a per-pod pdb, but that makes a ton of sense given the context, and I
   would say that's probably the near-term most viable solution (since there's
   so much in the air for future k8s revisions, and we wouldn't want to
   require bleeding edge k8s to run Solr safely).
   
   That said, I think it's worth taking the time to do this right, get other
   voices, and test things out. In the interim, my team is proceeding with a
   cluster-wide PDB, and a pod that will flip between 0-1 for availability on
   that in order to be overly cautious.
   
   I think that's a reasonable option as a stop-gap for us, but I'd love to
   help where I can in making this a first-party solution.
   
   How can I best help out?
   
   On Fri, Sep 16, 2022, 8:58 AM Houston Putman ***@***.***>
   wrote:
   
   > Just had a thought on this after perusing the docs further to see if
   > there's anything I could find to support our end goals within current
   > constraints:
   > https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
   > I can specify a disruption maxUnavailable of 0. This will prevent any
   > voluntary disruptions entirely.
   >
   > That is very interesting, and could certainly be something for us to look
   > into.
   >
   > If we go further down that idea, we *could* have a PDB for each pod
   > individually, and basically set the minAvailable to either 0 or 1
   > depending on whether it's ok to take down that pod at any given time (given
   > the same logic we use for restarts). That gives us a much more fine-tuned
   > ability to control this.
   >
   > It also occurred to me that if each SolrCloud had a PDB with a
   > maxUnavailable of 0 at all times, the Solr Operator could monitor the
   > cluster for node rotation behavior
   >
   > This is probably the best solution, if we can get it right. There are
   > things the Solr Operator generally wants to control before letting a pod
   > get deleted, such as moving replicas off of Solr Node with ephemeral data.
   > So if we are able to do that then I think we go for it.
   >
   > The new DistruptionCondition
   > <https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions>
   > stuff might give us that info, but its alpha in v1.25, so probably won't
   > be available by default for at least a few more versions. I'm also not sure
   > if it will put the condition on the pod if the PDB says not to delete it...
   > But it would certainly be the easiest way forward if we wanted to do this.
   >
   > Either way, we don't need to be perfect from the beginning. I say that for
   > now, we either go cluster-wide PDB or do per-pod PDBs. But I absolutely
   > love this discussion, and with a few new versions of Kubernetes, we can
   > probably get this to an amazing place.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/solr-operator/issues/471#issuecomment-1249533894>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ACGKLJISYCUZ5N5W7GOD7PDV6SKLRANCNFSM6AAAAAAQJBT6FE>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] joshsouza commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
joshsouza commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1247002626

   Ideas our team has been tossing around in discussions:
   
   `startupProbe` may also reduce risk (though still allows for some edge cases). If a newly starting pod had a startup probe that didn't go green until all of the shards assigned to that pod were active/recovered, that could prevent what I described above. However there's secondary risk there of stuff getting stuck (I.E. what if there's an inactive shard on that pod?)
   
   Sidecar readiness - I need to check up on the docs/test myself, but I'm curious if the _entire_ pod needs to pass all of its readiness checks in order to be active in the service, or if we can leverage a sidecar whose readiness check just verifies that all shards on that pod are in an active/ok state. If that works (that sidecar flapping doesn't impact the pod's availability to the real solr service) then a PDB would enforce the desired behavior (it would never allow for a pod to be taken out of commission if there was one in a not-ready state in the cluster). Side effect is that this could have detrimental effects on situations where it's ok to take down some pods while others are recovering, and slowing rotations etc..., so it's still just a half-measure to your suggestion of a _real_ check that uses Solr logic to indicate what pods are acceptable to disrupt via a PDB that ties together pods that own a shard. I'm just not sure that's going to be on a realistic horizon from the k
 8s timeline perspective.
   
   Sorry to brain dump, just thought I'd add what's on my mind to the conversation in case it's helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] joshsouza commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
joshsouza commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1247027292

   (Read up more carefully on the docs, we can't use the sidecar idea, because it would indicate the whole pod isn't ready, and drop it from the service)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] HoustonPutman closed issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
HoustonPutman closed issue #471: How to prevent node rotation behavior from causing cluster instability
URL: https://github.com/apache/solr-operator/issues/471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] joshsouza commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
joshsouza commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1247042942

   Just had a thought on this after perusing the docs further to see if there's anything I could find to support our end goals within current constraints: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
   I can specify a disruption `maxUnavailable` of `0`. This will _prevent any voluntary disruptions_ entirely.
   So if the operator managed the PDB, and set the `maxUnavailable` to, for example `1`, as long as every shard is happy, but when it detects that there are shards in a recovering state where an additional pod going down risks reliability, it can adjust the PDB to set `maxUnavailable` to `0` until that condition passes, then we could prevent additional eviction behavior until it's safe.
   
   I think this is a potentially viable solution until the platform supports multiple PDB's on a pod.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] mcarroll1 commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by "mcarroll1 (via GitHub)" <gi...@apache.org>.
mcarroll1 commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1693639328

   Also looking forward to some of the features suggested above... 
   
   This probably won't be the route for the operator, but posting an alternative idea here for others. Our centralized K8s management currently requires that we have a PDB of at least 1, so we can't necessarily go with flipping cluster level PDB between 0-1. 
   
   We were thinking of making a modified version of `/admin/collections?action=CLUSTERSTATUS` that essentially throws a non-200 when status is non-GREEN. This would involve writing Solr cluster-level plugins, which might not be ideal for those otherwise using vanilla Solr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] HoustonPutman commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
HoustonPutman commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1245949988

   This is a very good callout, so thank you for bringing it up.
   
   We can easily add a PodDisruptionBudget for the entire SolrCloud cluster, and the `maxUnavailable` can be populated with the `SolrCloud.spec.updateStrategy.managed.maxPodsUnavailable` value. This is a pretty good first-step and gets us halfway there.
   
   The next half would be replicating the `SolrCloud.spec.updateStrategy.managed.maxShardReplicasUnavailable` functionality through PDBs. Through the managed update code, we already understand the nodes that each shard resides on, so it wouldn't be far-fetched to create a PDB for every shard, using a custom labelSelector to pick out the node-name labels of nodes that we already know host that shard. We could even just routinely check (every minute or so) to update/create/delete PDBs, as we aren't listening to the cluster state in the cloud. The [PodDisruptionBudget documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors) tells us that we can't use `maxUnavailable`, as PDBs with custom labelSelectors can only use int-valued `minAvailable`. That's fine because we can always convert between the two, since we know the number of Nodes that host the shard.
   
   However, there's [another rule](https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors) for PDBs that makes this part of the solution untenable. It specifies that you can only have 1 `PodDisruptionBudget` per-pod, and for this solution we would need to have a PDB for every shard that lives on that pod, which will almost certainly be >1. (Otherwise the general cluster PDB should be fine to use)
   
   Hopefully Kubernetes will eventually remove the PDB per-pod limit, then we can fully (and not-too-difficultly) implement shard-level PDBs managed by the Solr Operator. In the meantime, we should go-ahead and implement the per-cluster `PodDisruptionBudget` and fill it with the value used in the managed update settings.
   
   Given the limitations, what are your thoughts on moving forward with the cluster-level PDB @joshsouza ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] joshsouza commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by GitBox <gi...@apache.org>.
joshsouza commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1246994429

   I 100% support adding a cluster-level PDB here, as that's definitely a first step towards success.
   My concern is that the PDB will ensure we don't take a pod down if one is already down, but in the scenario where a pod _just_ started and is coming online/doing a recovery for some large dataset that takes longer than the ReadinessProbe, from k8s' perspective it's safe to take down another pod, but from Solr's perspective it may be a risky operation.
   
   A cluster-level PDB will at least _reduce_ the level of risk here, but to the point of your (very thorough, thank you) note above, it's a step on the path to a final solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr-operator] iampranabroy commented on issue #471: How to prevent node rotation behavior from causing cluster instability

Posted by "iampranabroy (via GitHub)" <gi...@apache.org>.
iampranabroy commented on issue #471:
URL: https://github.com/apache/solr-operator/issues/471#issuecomment-1523499874

   @joshsouza Please let me know if you try the new version and if it helps resolve the problem. We kind of have a similar scenario. Will give it a try ourselves soon and will share our observations as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org