You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Vincenzo D'Amore <v....@gmail.com> on 2021/10/27 15:27:46 UTC

Solr & Kubernetes - how to configure the liveness

Hi all,

when a Solr instance is started I would be sure all the indexes present are
up and running, in other words that the instance is healthy.
The healthy status (aka liveness/readiness) is especially useful when a
Kubernetes SolrCloud cluster has to be restarted for any configuration
management needs and you want to apply your change one node at a time.
AFAIK I can ping only one index at a time, but there is no way out of the
box to test that a bunch of indexes are active (green status).
Have you ever faced the same problem? What do you think?

Best regards,
Vincenzo

-- 
Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Vincenzo D'Amore <v....@gmail.com>.

Hi Mathieu, Timothy,

Thanks, your contributions just made me realize that in my previous email,
the meaning I gave to the word liveness (container probes) was too generic.
For the sake of clarity, there are 3 types of container probes:
livenessProbe, readinessProbe, startupProbe.
livenessProbe: Indicates whether the container is running
readinessProbe: Indicates whether the container is ready to respond to
requests
startupProbe: Indicates whether the application within the container is
started

Given that, in my previous email I was referring to a scenario where you
are applying a configuration change to your solrcloud cluster (i.e. all the
solr pods have to be restarted)
There are many situations where you may need to apply a change to your
cluster that leads to a full restart: JVM config change (memory, garbage
collectors, system properties), Kubernetes config (vertical scale, env
variables, logging, etc.) or Solr config.
And in all these cases you cannot have /solr/admin/info/system as
startupProbe when your instances are queried in production.
This leads to catastrophic effects because kubernetes will restart in a
short time all the solr instances.
Restarting the instances in a short time means that one, more or all the
Core in a Solr node don't have the time to become "Active".
This happens mostly because, as Timothy said if you are restarting a busy
Solr pod with large collections with active update and
query traffic can take a "long" time to come back online.

On the other hand, /api/node/health?requireHealthyCores=true fits very well
with startupProbe in this scenario

        startupProbe:
          failureThreshold: 30
          httpGet:
            path: /api/node/health?requireHealthyCores=true
            port: 8983
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

On Fri, Nov 12, 2021 at 8:26 PM Timothy Potter <th...@gmail.com> wrote:

> oops! sent a little too soon ... also wanted to mention that if you're
> running Solr 8+, you can use /admin/info/health instead of
> /admin/info/system for the probe path (see:
> https://issues.apache.org/jira/browse/SOLR-11126), like this:
>
> livenessProbe:
>   httpGet:
>     path: /admin/info/health
>     port: 8983
> readinessProbe:
>   httpGet:
>     path: /admin/info/health
>     port: 8983
>
>
> On Fri, Nov 12, 2021 at 11:11 AM Timothy Potter <th...@apache.org>
> wrote:
> >
> > Some things to consider ...
> >
> > If one out of many Solr cores is down on a pod, I would not want
> > Kubelet to restart my Solr pod (if liveness probe fails) or even
> > remove it from the load-balancer service (if readiness probe fails)
> > because the pod can still serve traffic for the healthy cores.
> > Requiring all cores on a pod to be healthy seems like too high of a
> > bar for K8s probes.
> >
> > Killing a busy Solr pod with large collections with active update and
> > query traffic can take a "long" time to come back online (long being
> > relative to your typical Go based microservice that can restart in
> > milliseconds, which is what these probes were designed for)
> >
> > SolrCloud has its own request routing logic based on a very up-to-date
> > cluster state that's wired into ZK watches, so Solr can be resilient
> > to downed replicas provided there is at least one per shard that is
> > healthy.
> >
> > Moreover, replicas may take time to recover and the last thing you'd
> > want is for K8s to restart a pod while a replica is close to
> > recovering and re-entering the mix as a healthy replica.
> >
> > You could maybe use the request to requireHealthyCores=true for a
> startup probe.
> >
> > For me, the liveness / readiness probes are more applicable for
> > microservices that are fast to fail and restart and you can have many
> > of them so pulling one out of the load-balancer due to a readiness
> > probe failure is usually the right answer. Moreover, with
> > microservices, you typically have a service that does one thing, but
> > Solr pods typically host multiple cores.
> >
> > Lastly, the Solr operator allows you to customize the probe endpoints,
> > see:
> spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe.
> > We default it to /admin/info/system for the reasons I raised above.
> >
> > Tim
> >
> > On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie
> > <mm...@salesforce.com.invalid> wrote:
> > >
> > > Beware that using
> > > http://node:8983/api/node/health?requireHealthyCores=true for
> > > your liveness assumes that ZK is up and running.
> > > We are all hoping that ZK is never down, but if it happens, your Solr
> > > liveness probe will start to fail too, and K8S will restart all our
> Solr,
> > > adding instability to a cluster that is already in a bad shape.
> > >
> > > We've configured our liveness to /solr/admin/info/system too, and we
> rely
> > > on ZK liveness probe to restart ZK quickly if there is an issue.
> > > Liveness probes should never rely on a subsystem being up, else all
> your
> > > services will go down one after the other.
> > >
> > > Regards,
> > > Mathieu
>

-- 
Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Timothy Potter <th...@gmail.com>.

oops! sent a little too soon ... also wanted to mention that if you're
running Solr 8+, you can use /admin/info/health instead of
/admin/info/system for the probe path (see:
https://issues.apache.org/jira/browse/SOLR-11126), like this:

livenessProbe:
  httpGet:
    path: /admin/info/health
    port: 8983
readinessProbe:
  httpGet:
    path: /admin/info/health
    port: 8983


On Fri, Nov 12, 2021 at 11:11 AM Timothy Potter <th...@apache.org> wrote:
>
> Some things to consider ...
>
> If one out of many Solr cores is down on a pod, I would not want
> Kubelet to restart my Solr pod (if liveness probe fails) or even
> remove it from the load-balancer service (if readiness probe fails)
> because the pod can still serve traffic for the healthy cores.
> Requiring all cores on a pod to be healthy seems like too high of a
> bar for K8s probes.
>
> Killing a busy Solr pod with large collections with active update and
> query traffic can take a "long" time to come back online (long being
> relative to your typical Go based microservice that can restart in
> milliseconds, which is what these probes were designed for)
>
> SolrCloud has its own request routing logic based on a very up-to-date
> cluster state that's wired into ZK watches, so Solr can be resilient
> to downed replicas provided there is at least one per shard that is
> healthy.
>
> Moreover, replicas may take time to recover and the last thing you'd
> want is for K8s to restart a pod while a replica is close to
> recovering and re-entering the mix as a healthy replica.
>
> You could maybe use the request to requireHealthyCores=true for a startup probe.
>
> For me, the liveness / readiness probes are more applicable for
> microservices that are fast to fail and restart and you can have many
> of them so pulling one out of the load-balancer due to a readiness
> probe failure is usually the right answer. Moreover, with
> microservices, you typically have a service that does one thing, but
> Solr pods typically host multiple cores.
>
> Lastly, the Solr operator allows you to customize the probe endpoints,
> see: spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe.
> We default it to /admin/info/system for the reasons I raised above.
>
> Tim
>
> On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie
> <mm...@salesforce.com.invalid> wrote:
> >
> > Beware that using
> > http://node:8983/api/node/health?requireHealthyCores=true for
> > your liveness assumes that ZK is up and running.
> > We are all hoping that ZK is never down, but if it happens, your Solr
> > liveness probe will start to fail too, and K8S will restart all our Solr,
> > adding instability to a cluster that is already in a bad shape.
> >
> > We've configured our liveness to /solr/admin/info/system too, and we rely
> > on ZK liveness probe to restart ZK quickly if there is an issue.
> > Liveness probes should never rely on a subsystem being up, else all your
> > services will go down one after the other.
> >
> > Regards,
> > Mathieu

Re: Solr & Kubernetes - how to configure the liveness

Posted by Timothy Potter <th...@apache.org>.

Some things to consider ...

If one out of many Solr cores is down on a pod, I would not want
Kubelet to restart my Solr pod (if liveness probe fails) or even
remove it from the load-balancer service (if readiness probe fails)
because the pod can still serve traffic for the healthy cores.
Requiring all cores on a pod to be healthy seems like too high of a
bar for K8s probes.

Killing a busy Solr pod with large collections with active update and
query traffic can take a "long" time to come back online (long being
relative to your typical Go based microservice that can restart in
milliseconds, which is what these probes were designed for)

SolrCloud has its own request routing logic based on a very up-to-date
cluster state that's wired into ZK watches, so Solr can be resilient
to downed replicas provided there is at least one per shard that is
healthy.

Moreover, replicas may take time to recover and the last thing you'd
want is for K8s to restart a pod while a replica is close to
recovering and re-entering the mix as a healthy replica.

You could maybe use the request to requireHealthyCores=true for a startup probe.

For me, the liveness / readiness probes are more applicable for
microservices that are fast to fail and restart and you can have many
of them so pulling one out of the load-balancer due to a readiness
probe failure is usually the right answer. Moreover, with
microservices, you typically have a service that does one thing, but
Solr pods typically host multiple cores.

Lastly, the Solr operator allows you to customize the probe endpoints,
see: spec.customSolrKubeOptions.podOptions.(liveness|readiness|startup)Probe.
We default it to /admin/info/system for the reasons I raised above.

Tim

On Fri, Nov 12, 2021 at 9:15 AM Mathieu Marie
<mm...@salesforce.com.invalid> wrote:
>
> Beware that using
> http://node:8983/api/node/health?requireHealthyCores=true for
> your liveness assumes that ZK is up and running.
> We are all hoping that ZK is never down, but if it happens, your Solr
> liveness probe will start to fail too, and K8S will restart all our Solr,
> adding instability to a cluster that is already in a bad shape.
>
> We've configured our liveness to /solr/admin/info/system too, and we rely
> on ZK liveness probe to restart ZK quickly if there is an issue.
> Liveness probes should never rely on a subsystem being up, else all your
> services will go down one after the other.
>
> Regards,
> Mathieu

Re: Solr & Kubernetes - how to configure the liveness

Posted by Mathieu Marie <mm...@salesforce.com.INVALID>.

Beware that using
http://node:8983/api/node/health?requireHealthyCores=true for
your liveness assumes that ZK is up and running.
We are all hoping that ZK is never down, but if it happens, your Solr
liveness probe will start to fail too, and K8S will restart all our Solr,
adding instability to a cluster that is already in a bad shape.

We've configured our liveness to /solr/admin/info/system too, and we rely
on ZK liveness probe to restart ZK quickly if there is an issue.
Liveness probes should never rely on a subsystem being up, else all your
services will go down one after the other.

Regards,
Mathieu

Re: Solr & Kubernetes - how to configure the liveness

Posted by Vincenzo D'Amore <v....@gmail.com>.

On Fri, Nov 12, 2021 at 10:54 AM Jan Høydahl <ja...@cominvent.com> wrote:

> I agree that this is a risk. It all comes back to your initial sizing of
> the cluster.
> If you have decided for three nodes, and have HA policy of tolerating loss
> of
> any one server at a time, then you have to fully stress test your system
> with
> only two of those three nodes. If the two nodes cannot handle peak
> traffic, then
> you are fooling yourself to believe that you have fulfilled your HA policy.
> Some more crucial systems even have an N+2 HA policy, i.e. you should
> tolerate
> loss/crash of two random servers at the same time. Even more important to
> test
> the system in the failing condition! Time is also a factor here. The
> longer time it
> takes for a single node to reboot, the more likely that another node will
> crash during
> that window. So keeping the restart time low is always a bonus.
>
> It could be that if your nodes are few and large, with lots of replicas
> and lots of data,
> that it would be better to switch to a strategy with more smaller/cheaper
> nodes with
> fewer replicas each. Then the consequence of a node loss is smaller, and
> it is quicker
> to recover.
>

Your reasoning is correct but, IMHO, it is a little bit theoretical.
If we are talking about kubernetes and how a solrcloud cluster is deployed,
the problem is still about the liveness.
May I add few things:
- assuming that we have N solr instances with replica N for each core, if
the liveness is not so strictly configured, kubernetes can restart in short
order all the N instances.
In other words with a light liveness configuration, if N is not big enough,
you can have all the instances that have one or more cores that are not
ready.
- I would also add, that many customers don't have the money, the time or
the resources to have a bigger cluster or to implement an HA policy so well
done. So we need some give-and-take arrangements.
On the other hand, being strict on the health check and having good
monitoring can do the trick.


>
> I think and hope that the current liveliness logic in solr-operator is
> robust.
>
>
These days I was just digging down how the solr instance liveness is
configured in the solr-operator .
After having installed the example with 3 nodes I see the liveness is based
on  /solr/admin/info/system
Which is very unhelpful, well if what I have said is right.

This is the example I have used:
https://apache.github.io/solr-operator/docs/running-the-operator



> Jan
>
> > 12. nov. 2021 kl. 10:43 skrev Vincenzo D'Amore <v....@gmail.com>:
> >
> > Hi Jan,
> >
> > I agree, if liveness is not configured correctly we could end up in an
> > endless loop and the node never be healthy.
> > Please consider another scenario, a common case where there are at least
> 3
> > solr instances in production 24/7 high availability with a situation of
> > index light/heavy and query-heavy.
> > When we have to restart a solr instance, for whatever reason, the number
> of
> > seconds or minutes that we have to wait until all the cores come up could
> > be pretty high.
> > If we don't configure the liveness right kubernetes can restart the next
> > instance but the former is still recovering, coming up or whatever but it
> > is not ready.
> > So, for example, when we have to apply a change to solr config on all the
> > solr instances we really can't shutdown more than one of them.
> > When restarted we must wait for the full availability of the instance and
> > in the meanwhile the two remaining instances must have all the cores up
> and
> > running.
> > In other words, when you restart a solr instance, an increase of load on
> > the remaining instances usually slows the overall performance but, if
> > done badly it can bring the cluster down.
> >
> >
> > On Mon, Nov 1, 2021 at 5:10 PM Jan Høydahl <ja...@cominvent.com>
> wrote:
> >
> >> If recovery failed, then that core is dead, it has given up.
> >> So if an agent has just restarted or started a node, then it will wait
> >> until all cores have a "stable" or "final" state, before it declares the
> >> NODE as healthy, and consider restarting other nodes.
> >> If a core (replica of a shard in a collection) is in DOWN state, it has
> >> just booted and will soon go into RECOVERING. It will stay in RECOVERING
> >> until it either is OK or RECOVERY_FAILED.
> >> There is no point in waiting in an endless loop for every single core
> on a
> >> node to come up, we just want them to finish initializing and enter a
> >> stable state.
> >> I guess other logic in solr-operator will take care of deciding how many
> >> replicas for a shard are live, as to whether it is safe to take down the
> >> next pod/node.
> >>
> >> Jan
> >>
> >>> 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xi...@foxmail.com>:
> >>>
> >>> I'm a little puzzled, why UNHEALTHY_STATES does not contain
> >> State.RECOVERY_FAILED
> >>>
> >>>> 2021年10月31日 22:45，Jan Høydahl <ja...@cominvent.com> 写道：
> >>>>
> >>>> See
> >>
> https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers
> ,
> >> you can query each node with
> >>>>
> >>>> http://node:8983/api/node/health?requireHealthyCores=true
> >>>>
> >>>> It will only return HTTP 200 if all active cores on the node are
> >> healthy (none starting or recovering).
> >>>>
> >>>> Jan
> >>>>
> >>>>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v....@gmail.com>:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> when a Solr instance is started I would be sure all the indexes
> >> present are
> >>>>> up and running, in other words that the instance is healthy.
> >>>>> The healthy status (aka liveness/readiness) is especially useful
> when a
> >>>>> Kubernetes SolrCloud cluster has to be restarted for any
> configuration
> >>>>> management needs and you want to apply your change one node at a
> time.
> >>>>> AFAIK I can ping only one index at a time, but there is no way out of
> >> the
> >>>>> box to test that a bunch of indexes are active (green status).
> >>>>> Have you ever faced the same problem? What do you think?
> >>>>>
> >>>>> Best regards,
> >>>>> Vincenzo
> >>>>>
> >>>>> --
> >>>>> Vincenzo D'Amore
> >>>>
> >>>
> >>
> >>
> >
> > --
> > Vincenzo D'Amore
>
>

-- 
Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Jan Høydahl <ja...@cominvent.com>.

I agree that this is a risk. It all comes back to your initial sizing of the cluster.
If you have decided for three nodes, and have HA policy of tolerating loss of
any one server at a time, then you have to fully stress test your system with
only two of those three nodes. If the two nodes cannot handle peak traffic, then
you are fooling yourself to believe that you have fulfilled your HA policy.
Some more crucial systems even have an N+2 HA policy, i.e. you should tolerate
loss/crash of two random servers at the same time. Even more important to test
the system in the failing condition! Time is also a factor here. The longer time it
takes for a single node to reboot, the more likely that another node will crash during
that window. So keeping the restart time low is always a bonus.

It could be that if your nodes are few and large, with lots of replicas and lots of data,
that it would be better to switch to a strategy with more smaller/cheaper nodes with 
fewer replicas each. Then the consequence of a node loss is smaller, and it is quicker
to recover.

I think and hope that the current liveliness logic in solr-operator is robust.

Jan

> 12. nov. 2021 kl. 10:43 skrev Vincenzo D'Amore <v....@gmail.com>:
> 
> Hi Jan,
> 
> I agree, if liveness is not configured correctly we could end up in an
> endless loop and the node never be healthy.
> Please consider another scenario, a common case where there are at least 3
> solr instances in production 24/7 high availability with a situation of
> index light/heavy and query-heavy.
> When we have to restart a solr instance, for whatever reason, the number of
> seconds or minutes that we have to wait until all the cores come up could
> be pretty high.
> If we don't configure the liveness right kubernetes can restart the next
> instance but the former is still recovering, coming up or whatever but it
> is not ready.
> So, for example, when we have to apply a change to solr config on all the
> solr instances we really can't shutdown more than one of them.
> When restarted we must wait for the full availability of the instance and
> in the meanwhile the two remaining instances must have all the cores up and
> running.
> In other words, when you restart a solr instance, an increase of load on
> the remaining instances usually slows the overall performance but, if
> done badly it can bring the cluster down.
> 
> 
> On Mon, Nov 1, 2021 at 5:10 PM Jan Høydahl <ja...@cominvent.com> wrote:
> 
>> If recovery failed, then that core is dead, it has given up.
>> So if an agent has just restarted or started a node, then it will wait
>> until all cores have a "stable" or "final" state, before it declares the
>> NODE as healthy, and consider restarting other nodes.
>> If a core (replica of a shard in a collection) is in DOWN state, it has
>> just booted and will soon go into RECOVERING. It will stay in RECOVERING
>> until it either is OK or RECOVERY_FAILED.
>> There is no point in waiting in an endless loop for every single core on a
>> node to come up, we just want them to finish initializing and enter a
>> stable state.
>> I guess other logic in solr-operator will take care of deciding how many
>> replicas for a shard are live, as to whether it is safe to take down the
>> next pod/node.
>> 
>> Jan
>> 
>>> 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xi...@foxmail.com>:
>>> 
>>> I'm a little puzzled, why UNHEALTHY_STATES does not contain
>> State.RECOVERY_FAILED
>>> 
>>>> 2021年10月31日 22:45，Jan Høydahl <ja...@cominvent.com> 写道：
>>>> 
>>>> See
>> https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers,
>> you can query each node with
>>>> 
>>>> http://node:8983/api/node/health?requireHealthyCores=true
>>>> 
>>>> It will only return HTTP 200 if all active cores on the node are
>> healthy (none starting or recovering).
>>>> 
>>>> Jan
>>>> 
>>>>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v....@gmail.com>:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> when a Solr instance is started I would be sure all the indexes
>> present are
>>>>> up and running, in other words that the instance is healthy.
>>>>> The healthy status (aka liveness/readiness) is especially useful when a
>>>>> Kubernetes SolrCloud cluster has to be restarted for any configuration
>>>>> management needs and you want to apply your change one node at a time.
>>>>> AFAIK I can ping only one index at a time, but there is no way out of
>> the
>>>>> box to test that a bunch of indexes are active (green status).
>>>>> Have you ever faced the same problem? What do you think?
>>>>> 
>>>>> Best regards,
>>>>> Vincenzo
>>>>> 
>>>>> --
>>>>> Vincenzo D'Amore
>>>> 
>>> 
>> 
>> 
> 
> -- 
> Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Vincenzo D'Amore <v....@gmail.com>.

Hi Jan,

I agree, if liveness is not configured correctly we could end up in an
endless loop and the node never be healthy.
Please consider another scenario, a common case where there are at least 3
solr instances in production 24/7 high availability with a situation of
index light/heavy and query-heavy.
When we have to restart a solr instance, for whatever reason, the number of
seconds or minutes that we have to wait until all the cores come up could
be pretty high.
If we don't configure the liveness right kubernetes can restart the next
instance but the former is still recovering, coming up or whatever but it
is not ready.
So, for example, when we have to apply a change to solr config on all the
solr instances we really can't shutdown more than one of them.
When restarted we must wait for the full availability of the instance and
in the meanwhile the two remaining instances must have all the cores up and
running.
In other words, when you restart a solr instance, an increase of load on
the remaining instances usually slows the overall performance but, if
done badly it can bring the cluster down.

On Mon, Nov 1, 2021 at 5:10 PM Jan Høydahl <ja...@cominvent.com> wrote:

> If recovery failed, then that core is dead, it has given up.
> So if an agent has just restarted or started a node, then it will wait
> until all cores have a "stable" or "final" state, before it declares the
> NODE as healthy, and consider restarting other nodes.
> If a core (replica of a shard in a collection) is in DOWN state, it has
> just booted and will soon go into RECOVERING. It will stay in RECOVERING
> until it either is OK or RECOVERY_FAILED.
> There is no point in waiting in an endless loop for every single core on a
> node to come up, we just want them to finish initializing and enter a
> stable state.
> I guess other logic in solr-operator will take care of deciding how many
> replicas for a shard are live, as to whether it is safe to take down the
> next pod/node.
>
> Jan
>
> > 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xi...@foxmail.com>:
> >
> > I'm a little puzzled, why UNHEALTHY_STATES does not contain
> State.RECOVERY_FAILED
> >
> >> 2021年10月31日 22:45，Jan Høydahl <ja...@cominvent.com> 写道：
> >>
> >> See
> https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers,
> you can query each node with
> >>
> >> http://node:8983/api/node/health?requireHealthyCores=true
> >>
> >> It will only return HTTP 200 if all active cores on the node are
> healthy (none starting or recovering).
> >>
> >> Jan
> >>
> >>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v....@gmail.com>:
> >>>
> >>> Hi all,
> >>>
> >>> when a Solr instance is started I would be sure all the indexes
> present are
> >>> up and running, in other words that the instance is healthy.
> >>> The healthy status (aka liveness/readiness) is especially useful when a
> >>> Kubernetes SolrCloud cluster has to be restarted for any configuration
> >>> management needs and you want to apply your change one node at a time.
> >>> AFAIK I can ping only one index at a time, but there is no way out of
> the
> >>> box to test that a bunch of indexes are active (green status).
> >>> Have you ever faced the same problem? What do you think?
> >>>
> >>> Best regards,
> >>> Vincenzo
> >>>
> >>> --
> >>> Vincenzo D'Amore
> >>
> >
>
>

-- 
Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by 戴晓彬 <xi...@foxmail.com>.

Thanks, Jan, this is helpful to me.
I thought about it for a long time, but I finally figured it out.

> 2021年11月2日 00:03，Jan Høydahl <ja...@cominvent.com> 写道：
> 
> If recovery failed, then that core is dead, it has given up.
> So if an agent has just restarted or started a node, then it will wait until all cores have a "stable" or "final" state, before it declares the NODE as healthy, and consider restarting other nodes.
> If a core (replica of a shard in a collection) is in DOWN state, it has just booted and will soon go into RECOVERING. It will stay in RECOVERING until it either is OK or RECOVERY_FAILED.
> There is no point in waiting in an endless loop for every single core on a node to come up, we just want them to finish initializing and enter a stable state.
> I guess other logic in solr-operator will take care of deciding how many replicas for a shard are live, as to whether it is safe to take down the next pod/node.
> 
> Jan
> 
>> 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xi...@foxmail.com>:
>> 
>> I'm a little puzzled, why UNHEALTHY_STATES does not contain State.RECOVERY_FAILED
>> 
>>> 2021年10月31日 22:45，Jan Høydahl <ja...@cominvent.com> 写道：
>>> 
>>> See https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers, you can query each node with 
>>> 
>>> http://node:8983/api/node/health?requireHealthyCores=true
>>> 
>>> It will only return HTTP 200 if all active cores on the node are healthy (none starting or recovering).
>>> 
>>> Jan
>>> 
>>>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v....@gmail.com>:
>>>> 
>>>> Hi all,
>>>> 
>>>> when a Solr instance is started I would be sure all the indexes present are
>>>> up and running, in other words that the instance is healthy.
>>>> The healthy status (aka liveness/readiness) is especially useful when a
>>>> Kubernetes SolrCloud cluster has to be restarted for any configuration
>>>> management needs and you want to apply your change one node at a time.
>>>> AFAIK I can ping only one index at a time, but there is no way out of the
>>>> box to test that a bunch of indexes are active (green status).
>>>> Have you ever faced the same problem? What do you think?
>>>> 
>>>> Best regards,
>>>> Vincenzo
>>>> 
>>>> -- 
>>>> Vincenzo D'Amore
>>> 
>> 
>

Re: Solr & Kubernetes - how to configure the liveness

Posted by Jan Høydahl <ja...@cominvent.com>.

If recovery failed, then that core is dead, it has given up.
So if an agent has just restarted or started a node, then it will wait until all cores have a "stable" or "final" state, before it declares the NODE as healthy, and consider restarting other nodes.
If a core (replica of a shard in a collection) is in DOWN state, it has just booted and will soon go into RECOVERING. It will stay in RECOVERING until it either is OK or RECOVERY_FAILED.
There is no point in waiting in an endless loop for every single core on a node to come up, we just want them to finish initializing and enter a stable state.
I guess other logic in solr-operator will take care of deciding how many replicas for a shard are live, as to whether it is safe to take down the next pod/node.

Jan

> 31. okt. 2021 kl. 16:14 skrev 戴晓彬 <xi...@foxmail.com>:
> 
> I'm a little puzzled, why UNHEALTHY_STATES does not contain State.RECOVERY_FAILED
> 
>> 2021年10月31日 22:45，Jan Høydahl <ja...@cominvent.com> 写道：
>> 
>> See https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers, you can query each node with 
>> 
>> http://node:8983/api/node/health?requireHealthyCores=true
>> 
>> It will only return HTTP 200 if all active cores on the node are healthy (none starting or recovering).
>> 
>> Jan
>> 
>>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v....@gmail.com>:
>>> 
>>> Hi all,
>>> 
>>> when a Solr instance is started I would be sure all the indexes present are
>>> up and running, in other words that the instance is healthy.
>>> The healthy status (aka liveness/readiness) is especially useful when a
>>> Kubernetes SolrCloud cluster has to be restarted for any configuration
>>> management needs and you want to apply your change one node at a time.
>>> AFAIK I can ping only one index at a time, but there is no way out of the
>>> box to test that a bunch of indexes are active (green status).
>>> Have you ever faced the same problem? What do you think?
>>> 
>>> Best regards,
>>> Vincenzo
>>> 
>>> -- 
>>> Vincenzo D'Amore
>> 
>

Re: Solr & Kubernetes - how to configure the liveness

Posted by 戴晓彬 <xi...@foxmail.com>.

I'm a little puzzled, why UNHEALTHY_STATES does not contain State.RECOVERY_FAILED

> 2021年10月31日 22:45，Jan Høydahl <ja...@cominvent.com> 写道：
> 
> See https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers, you can query each node with 
> 
> http://node:8983/api/node/health?requireHealthyCores=true
> 
> It will only return HTTP 200 if all active cores on the node are healthy (none starting or recovering).
> 
> Jan
> 
>> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v....@gmail.com>:
>> 
>> Hi all,
>> 
>> when a Solr instance is started I would be sure all the indexes present are
>> up and running, in other words that the instance is healthy.
>> The healthy status (aka liveness/readiness) is especially useful when a
>> Kubernetes SolrCloud cluster has to be restarted for any configuration
>> management needs and you want to apply your change one node at a time.
>> AFAIK I can ping only one index at a time, but there is no way out of the
>> box to test that a bunch of indexes are active (green status).
>> Have you ever faced the same problem? What do you think?
>> 
>> Best regards,
>> Vincenzo
>> 
>> -- 
>> Vincenzo D'Amore
>

Re: Solr & Kubernetes - how to configure the liveness

Posted by Jan Høydahl <ja...@cominvent.com>.

See https://solr.apache.org/guide/8_10/implicit-requesthandlers.html#admin-handlers, you can query each node with 

http://node:8983/api/node/health?requireHealthyCores=true

It will only return HTTP 200 if all active cores on the node are healthy (none starting or recovering).

Jan

> 27. okt. 2021 kl. 17:27 skrev Vincenzo D'Amore <v....@gmail.com>:
> 
> Hi all,
> 
> when a Solr instance is started I would be sure all the indexes present are
> up and running, in other words that the instance is healthy.
> The healthy status (aka liveness/readiness) is especially useful when a
> Kubernetes SolrCloud cluster has to be restarted for any configuration
> management needs and you want to apply your change one node at a time.
> AFAIK I can ping only one index at a time, but there is no way out of the
> box to test that a bunch of indexes are active (green status).
> Have you ever faced the same problem? What do you think?
> 
> Best regards,
> Vincenzo
> 
> -- 
> Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Vincenzo D'Amore <v....@gmail.com>.

Right.

retVal=$(curl -s "
http://localhost:8983/solr/admin/cores?action=STATUS&wt=json" | grep
'"initFailures":{},')

retVal will be 0 if everything is ok. For the moment this should do the
trick.



On Wed, Oct 27, 2021 at 7:06 PM Robert Pearce <rp...@gmail.com> wrote:

> I think it will remain as 200 - it is returning the status of the cores.
> If the call itself fails then of course the HTTP status would reflect that.
>
> I think the Solr Admin UI uses this call on one of the cloud pages.
>
> Rob
>
> > On 27 Oct 2021, at 17:29, Vincenzo D'Amore <v....@gmail.com> wrote:
> >
> > HI Rob, thanks for your help.
> > Do you know if in case of failure (initFailures not empty)
> > /solr/admin/cores changes the http status code of the response in 500 (or
> > everything that is not 200) ?
> >
> >> On Wed, Oct 27, 2021 at 6:13 PM Robert Pearce <rp...@gmail.com> wrote:
> >>
> >> Take a look at the cores REST API, something like
> >>
> >> http://localhost:8983/solr/admin/cores?action=STATUS&wt=json
> >>
> >> Any failed cores will be in ‘initFailures’; cores which started will be
> >> under “status”
> >>
> >> Rob
> >>
> >>>> On 27 Oct 2021, at 16:28, Vincenzo D'Amore <v....@gmail.com>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> when a Solr instance is started I would be sure all the indexes present
> >> are
> >>> up and running, in other words that the instance is healthy.
> >>> The healthy status (aka liveness/readiness) is especially useful when a
> >>> Kubernetes SolrCloud cluster has to be restarted for any configuration
> >>> management needs and you want to apply your change one node at a time.
> >>> AFAIK I can ping only one index at a time, but there is no way out of
> the
> >>> box to test that a bunch of indexes are active (green status).
> >>> Have you ever faced the same problem? What do you think?
> >>>
> >>> Best regards,
> >>> Vincenzo
> >>>
> >>> --
> >>> Vincenzo D'Amore
> >>
> >
> >
> > --
> > Vincenzo D'Amore
>


-- 
Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Robert Pearce <rp...@gmail.com>.

I think it will remain as 200 - it is returning the status of the cores. If the call itself fails then of course the HTTP status would reflect that.

I think the Solr Admin UI uses this call on one of the cloud pages.

Rob

> On 27 Oct 2021, at 17:29, Vincenzo D'Amore <v....@gmail.com> wrote:
> 
> HI Rob, thanks for your help.
> Do you know if in case of failure (initFailures not empty)
> /solr/admin/cores changes the http status code of the response in 500 (or
> everything that is not 200) ?
> 
>> On Wed, Oct 27, 2021 at 6:13 PM Robert Pearce <rp...@gmail.com> wrote:
>> 
>> Take a look at the cores REST API, something like
>> 
>> http://localhost:8983/solr/admin/cores?action=STATUS&wt=json
>> 
>> Any failed cores will be in ‘initFailures’; cores which started will be
>> under “status”
>> 
>> Rob
>> 
>>>> On 27 Oct 2021, at 16:28, Vincenzo D'Amore <v....@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> when a Solr instance is started I would be sure all the indexes present
>> are
>>> up and running, in other words that the instance is healthy.
>>> The healthy status (aka liveness/readiness) is especially useful when a
>>> Kubernetes SolrCloud cluster has to be restarted for any configuration
>>> management needs and you want to apply your change one node at a time.
>>> AFAIK I can ping only one index at a time, but there is no way out of the
>>> box to test that a bunch of indexes are active (green status).
>>> Have you ever faced the same problem? What do you think?
>>> 
>>> Best regards,
>>> Vincenzo
>>> 
>>> --
>>> Vincenzo D'Amore
>> 
> 
> 
> -- 
> Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Vincenzo D'Amore <v....@gmail.com>.

Thanks this is really interesting 

> 
> On 27 Oct 2021, at 18:36, Houston Putman <ho...@apache.org> wrote:
> 
> Vincenzo,
> 
> If you use the Solr Operator <https://solr.apache.org/operator/>, it will
> manage the upgrades for you in a safe manner (waiting for x number of
> replicas to be healthy before moving onto the next node).
> 
> Hopefully the following documentation pages will help:
> 
>   - CRD Options for Update Strategy
>   <https://apache.github.io/solr-operator/docs/solr-cloud/solr-cloud-crd.html#update-strategy>
>   - Managed Update Logic
>   <https://apache.github.io/solr-operator/docs/solr-cloud/managed-updates.html>
> 
> You can configure it so that it will upgrade at most 1 Solr Node at a time,
> and only have 1 replica of each shard unhealthy at any given time.
> 
> - Houston
> 
>> On Wed, Oct 27, 2021 at 12:29 PM Vincenzo D'Amore <v....@gmail.com>
>> wrote:
>> 
>> HI Rob, thanks for your help.
>> Do you know if in case of failure (initFailures not empty)
>> /solr/admin/cores changes the http status code of the response in 500 (or
>> everything that is not 200) ?
>> 
>>> On Wed, Oct 27, 2021 at 6:13 PM Robert Pearce <rp...@gmail.com> wrote:
>>> 
>>> Take a look at the cores REST API, something like
>>> 
>>> http://localhost:8983/solr/admin/cores?action=STATUS&wt=json
>>> 
>>> Any failed cores will be in ‘initFailures’; cores which started will be
>>> under “status”
>>> 
>>> Rob
>>> 
>>>> On 27 Oct 2021, at 16:28, Vincenzo D'Amore <v....@gmail.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> when a Solr instance is started I would be sure all the indexes present
>>> are
>>>> up and running, in other words that the instance is healthy.
>>>> The healthy status (aka liveness/readiness) is especially useful when a
>>>> Kubernetes SolrCloud cluster has to be restarted for any configuration
>>>> management needs and you want to apply your change one node at a time.
>>>> AFAIK I can ping only one index at a time, but there is no way out of
>> the
>>>> box to test that a bunch of indexes are active (green status).
>>>> Have you ever faced the same problem? What do you think?
>>>> 
>>>> Best regards,
>>>> Vincenzo
>>>> 
>>>> --
>>>> Vincenzo D'Amore
>>> 
>> 
>> 
>> --
>> Vincenzo D'Amore
>>

Re: Solr & Kubernetes - how to configure the liveness

Posted by Vincenzo D'Amore <v....@gmail.com>.

Hi, thanks, this is really helpful. I'll have a look.

On Wed, Oct 27, 2021 at 6:36 PM Houston Putman <ho...@apache.org> wrote:

> Vincenzo,
>
> If you use the Solr Operator <https://solr.apache.org/operator/>, it will
> manage the upgrades for you in a safe manner (waiting for x number of
> replicas to be healthy before moving onto the next node).
>
> Hopefully the following documentation pages will help:
>
>    - CRD Options for Update Strategy
>    <
> https://apache.github.io/solr-operator/docs/solr-cloud/solr-cloud-crd.html#update-strategy
> >
>    - Managed Update Logic
>    <
> https://apache.github.io/solr-operator/docs/solr-cloud/managed-updates.html
> >
>
> You can configure it so that it will upgrade at most 1 Solr Node at a time,
> and only have 1 replica of each shard unhealthy at any given time.
>
> - Houston
>
> On Wed, Oct 27, 2021 at 12:29 PM Vincenzo D'Amore <v....@gmail.com>
> wrote:
>
> > HI Rob, thanks for your help.
> > Do you know if in case of failure (initFailures not empty)
> > /solr/admin/cores changes the http status code of the response in 500 (or
> > everything that is not 200) ?
> >
> > On Wed, Oct 27, 2021 at 6:13 PM Robert Pearce <rp...@gmail.com> wrote:
> >
> > > Take a look at the cores REST API, something like
> > >
> > > http://localhost:8983/solr/admin/cores?action=STATUS&wt=json
> > >
> > > Any failed cores will be in ‘initFailures’; cores which started will be
> > > under “status”
> > >
> > > Rob
> > >
> > > > On 27 Oct 2021, at 16:28, Vincenzo D'Amore <v....@gmail.com>
> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > when a Solr instance is started I would be sure all the indexes
> present
> > > are
> > > > up and running, in other words that the instance is healthy.
> > > > The healthy status (aka liveness/readiness) is especially useful
> when a
> > > > Kubernetes SolrCloud cluster has to be restarted for any
> configuration
> > > > management needs and you want to apply your change one node at a
> time.
> > > > AFAIK I can ping only one index at a time, but there is no way out of
> > the
> > > > box to test that a bunch of indexes are active (green status).
> > > > Have you ever faced the same problem? What do you think?
> > > >
> > > > Best regards,
> > > > Vincenzo
> > > >
> > > > --
> > > > Vincenzo D'Amore
> > >
> >
> >
> > --
> > Vincenzo D'Amore
> >
>


-- 
Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Houston Putman <ho...@apache.org>.

Vincenzo,

If you use the Solr Operator <https://solr.apache.org/operator/>, it will
manage the upgrades for you in a safe manner (waiting for x number of
replicas to be healthy before moving onto the next node).

Hopefully the following documentation pages will help:

   - CRD Options for Update Strategy
   <https://apache.github.io/solr-operator/docs/solr-cloud/solr-cloud-crd.html#update-strategy>
   - Managed Update Logic
   <https://apache.github.io/solr-operator/docs/solr-cloud/managed-updates.html>

You can configure it so that it will upgrade at most 1 Solr Node at a time,
and only have 1 replica of each shard unhealthy at any given time.

- Houston

On Wed, Oct 27, 2021 at 12:29 PM Vincenzo D'Amore <v....@gmail.com>
wrote:

> HI Rob, thanks for your help.
> Do you know if in case of failure (initFailures not empty)
> /solr/admin/cores changes the http status code of the response in 500 (or
> everything that is not 200) ?
>
> On Wed, Oct 27, 2021 at 6:13 PM Robert Pearce <rp...@gmail.com> wrote:
>
> > Take a look at the cores REST API, something like
> >
> > http://localhost:8983/solr/admin/cores?action=STATUS&wt=json
> >
> > Any failed cores will be in ‘initFailures’; cores which started will be
> > under “status”
> >
> > Rob
> >
> > > On 27 Oct 2021, at 16:28, Vincenzo D'Amore <v....@gmail.com> wrote:
> > >
> > > Hi all,
> > >
> > > when a Solr instance is started I would be sure all the indexes present
> > are
> > > up and running, in other words that the instance is healthy.
> > > The healthy status (aka liveness/readiness) is especially useful when a
> > > Kubernetes SolrCloud cluster has to be restarted for any configuration
> > > management needs and you want to apply your change one node at a time.
> > > AFAIK I can ping only one index at a time, but there is no way out of
> the
> > > box to test that a bunch of indexes are active (green status).
> > > Have you ever faced the same problem? What do you think?
> > >
> > > Best regards,
> > > Vincenzo
> > >
> > > --
> > > Vincenzo D'Amore
> >
>
>
> --
> Vincenzo D'Amore
>

Re: Solr & Kubernetes - how to configure the liveness

Posted by Vincenzo D'Amore <v....@gmail.com>.

HI Rob, thanks for your help.
Do you know if in case of failure (initFailures not empty)
/solr/admin/cores changes the http status code of the response in 500 (or
everything that is not 200) ?

On Wed, Oct 27, 2021 at 6:13 PM Robert Pearce <rp...@gmail.com> wrote:

> Take a look at the cores REST API, something like
>
> http://localhost:8983/solr/admin/cores?action=STATUS&wt=json
>
> Any failed cores will be in ‘initFailures’; cores which started will be
> under “status”
>
> Rob
>
> > On 27 Oct 2021, at 16:28, Vincenzo D'Amore <v....@gmail.com> wrote:
> >
> > Hi all,
> >
> > when a Solr instance is started I would be sure all the indexes present
> are
> > up and running, in other words that the instance is healthy.
> > The healthy status (aka liveness/readiness) is especially useful when a
> > Kubernetes SolrCloud cluster has to be restarted for any configuration
> > management needs and you want to apply your change one node at a time.
> > AFAIK I can ping only one index at a time, but there is no way out of the
> > box to test that a bunch of indexes are active (green status).
> > Have you ever faced the same problem? What do you think?
> >
> > Best regards,
> > Vincenzo
> >
> > --
> > Vincenzo D'Amore
>


-- 
Vincenzo D'Amore

Re: Solr & Kubernetes - how to configure the liveness

Posted by Robert Pearce <rp...@gmail.com>.

Take a look at the cores REST API, something like

http://localhost:8983/solr/admin/cores?action=STATUS&wt=json

Any failed cores will be in ‘initFailures’; cores which started will be under “status”

Rob

> On 27 Oct 2021, at 16:28, Vincenzo D'Amore <v....@gmail.com> wrote:
> 
> Hi all,
> 
> when a Solr instance is started I would be sure all the indexes present are
> up and running, in other words that the instance is healthy.
> The healthy status (aka liveness/readiness) is especially useful when a
> Kubernetes SolrCloud cluster has to be restarted for any configuration
> management needs and you want to apply your change one node at a time.
> AFAIK I can ping only one index at a time, but there is no way out of the
> box to test that a bunch of indexes are active (green status).
> Have you ever faced the same problem? What do you think?
> 
> Best regards,
> Vincenzo
> 
> -- 
> Vincenzo D'Amore