You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Joel Bernstein <jo...@gmail.com> on 2021/10/12 22:44:59 UTC

Rolling restarts and the Solr Operator

Hi,

I saw that the Solr operator takes into account collection topology when
performing rolling restarts. In a situation where there is one SolrCloud
object per-shard, I'm wondering how this will behave. In this case the Solr
Operator would receive a different CR for each shard which would kick off
the rolling restarts in parallel. Would the operator be able to understand
that it was operating on a single shard in each CR and not get tangled up
in the larger cluster state?

Thanks,
Joel

Re: Rolling restarts and the Solr Operator

Posted by Joel Bernstein <jo...@gmail.com>.

Thanks Houston,

You are right, the main motivation for the SolrCloud per shard is
auto-scaling.

Here is the issue I created:
https://github.com/apache/solr-operator/issues/348



Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Oct 14, 2021 at 2:09 PM Houston Putman <ho...@gmail.com>
wrote:

> Ok, I found why this is happening:
>
>
> https://github.com/apache/solr-operator/blob/v0.4.0/controllers/util/solr_update_util.go#L185
>
> Basically we make the assumption that the number of nodes in the
> statefulset is the same number of nodes in the cluster state.
> We should remove this check and just make sure that all of the nodes we
> care about are in the cluster state live nodes.
> That would solve this.
>
> Do you mind creating a Github Issue? This should be an easy fix to make
> this paradigm "supported" in v0.5.0.
>
> Also it would be great to allow the SolrCloud to be split into multiple
> StatefulSets in v0.6.0 (or sometime in the future), so that you don't have
> to manage multiple SolrCloud resources independently.
>
> - Houston
>
> On Thu, Oct 14, 2021 at 2:05 PM Houston Putman <ho...@gmail.com>
> wrote:
>
> > So this is interesting.
> >
> > I'm assuming that you are running a SolrCloud resource per-shard, so that
> > you can set system properties separately for autoscaling purposes.
> > The Solr Operator assumes that each cloud it is managing is independent.
> > However, the rolling restart process really just kills as many pods as
> > possible until the cluster state is too unhealthy to kill more
> > (configurable).
> >
> > In theory it should be fine to do a rolling restart at the same time on
> > each SolrCloud resource.
> > This is especially true because no two-SolrCloud resources share shard,
> so
> > their restarts should not affect each other.
> > (Actually you have devised the only truly safe way of upgrading multiple
> > SolrCloud resources at the same time that are actually one large cloud)
> >
> > The only overlap in logic between the SolrCloud resources is the
> overseer.
> > The logic in the solr operator is to restart the overseer last, and wait
> > for all nodes to be live and the cluster state to be healthy before
> killing
> > it.
> >
> > Are you seeing that all other node upgrades have succeeded, and the
> > cluster is healthy, but the overseer is still not upgraded?
> >
> > On Thu, Oct 14, 2021 at 1:50 PM Joel Bernstein <jo...@gmail.com>
> wrote:
> >
> >> This is a followup to my last question with my findings thus far. In a
> >> scenario where there is one SolrCloud resource per-shard I'm seeing the
> >> overseer node get skipped entirely during rolling restarts. So, it
> appears
> >> the solr-operator can only manage rolling restarts when there is one
> >> SolrCloud object in the cluster.
> >>
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >>
> >> On Tue, Oct 12, 2021 at 6:44 PM Joel Bernstein <jo...@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I saw that the Solr operator takes into account collection topology
> when
> >> > performing rolling restarts. In a situation where there is one
> SolrCloud
> >> > object per-shard, I'm wondering how this will behave. In this case the
> >> Solr
> >> > Operator would receive a different CR for each shard which would kick
> >> off
> >> > the rolling restarts in parallel. Would the operator be able to
> >> understand
> >> > that it was operating on a single shard in each CR and not get tangled
> >> up
> >> > in the larger cluster state?
> >> >
> >> > Thanks,
> >> > Joel
> >> >
> >> >
> >> >
> >>
> >
>

Re: Rolling restarts and the Solr Operator

Posted by Houston Putman <ho...@gmail.com>.

Ok, I found why this is happening:

https://github.com/apache/solr-operator/blob/v0.4.0/controllers/util/solr_update_util.go#L185

Basically we make the assumption that the number of nodes in the
statefulset is the same number of nodes in the cluster state.
We should remove this check and just make sure that all of the nodes we
care about are in the cluster state live nodes.
That would solve this.

Do you mind creating a Github Issue? This should be an easy fix to make
this paradigm "supported" in v0.5.0.

Also it would be great to allow the SolrCloud to be split into multiple
StatefulSets in v0.6.0 (or sometime in the future), so that you don't have
to manage multiple SolrCloud resources independently.

- Houston

On Thu, Oct 14, 2021 at 2:05 PM Houston Putman <ho...@gmail.com>
wrote:

> So this is interesting.
>
> I'm assuming that you are running a SolrCloud resource per-shard, so that
> you can set system properties separately for autoscaling purposes.
> The Solr Operator assumes that each cloud it is managing is independent.
> However, the rolling restart process really just kills as many pods as
> possible until the cluster state is too unhealthy to kill more
> (configurable).
>
> In theory it should be fine to do a rolling restart at the same time on
> each SolrCloud resource.
> This is especially true because no two-SolrCloud resources share shard, so
> their restarts should not affect each other.
> (Actually you have devised the only truly safe way of upgrading multiple
> SolrCloud resources at the same time that are actually one large cloud)
>
> The only overlap in logic between the SolrCloud resources is the overseer.
> The logic in the solr operator is to restart the overseer last, and wait
> for all nodes to be live and the cluster state to be healthy before killing
> it.
>
> Are you seeing that all other node upgrades have succeeded, and the
> cluster is healthy, but the overseer is still not upgraded?
>
> On Thu, Oct 14, 2021 at 1:50 PM Joel Bernstein <jo...@gmail.com> wrote:
>
>> This is a followup to my last question with my findings thus far. In a
>> scenario where there is one SolrCloud resource per-shard I'm seeing the
>> overseer node get skipped entirely during rolling restarts. So, it appears
>> the solr-operator can only manage rolling restarts when there is one
>> SolrCloud object in the cluster.
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Tue, Oct 12, 2021 at 6:44 PM Joel Bernstein <jo...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I saw that the Solr operator takes into account collection topology when
>> > performing rolling restarts. In a situation where there is one SolrCloud
>> > object per-shard, I'm wondering how this will behave. In this case the
>> Solr
>> > Operator would receive a different CR for each shard which would kick
>> off
>> > the rolling restarts in parallel. Would the operator be able to
>> understand
>> > that it was operating on a single shard in each CR and not get tangled
>> up
>> > in the larger cluster state?
>> >
>> > Thanks,
>> > Joel
>> >
>> >
>> >
>>
>

Re: Rolling restarts and the Solr Operator

Posted by Houston Putman <ho...@gmail.com>.

So this is interesting.

I'm assuming that you are running a SolrCloud resource per-shard, so that
you can set system properties separately for autoscaling purposes.
The Solr Operator assumes that each cloud it is managing is independent.
However, the rolling restart process really just kills as many pods as
possible until the cluster state is too unhealthy to kill more
(configurable).

In theory it should be fine to do a rolling restart at the same time on
each SolrCloud resource.
This is especially true because no two-SolrCloud resources share shard, so
their restarts should not affect each other.
(Actually you have devised the only truly safe way of upgrading multiple
SolrCloud resources at the same time that are actually one large cloud)

The only overlap in logic between the SolrCloud resources is the overseer.
The logic in the solr operator is to restart the overseer last, and wait
for all nodes to be live and the cluster state to be healthy before killing
it.

Are you seeing that all other node upgrades have succeeded, and the cluster
is healthy, but the overseer is still not upgraded?

On Thu, Oct 14, 2021 at 1:50 PM Joel Bernstein <jo...@gmail.com> wrote:

> This is a followup to my last question with my findings thus far. In a
> scenario where there is one SolrCloud resource per-shard I'm seeing the
> overseer node get skipped entirely during rolling restarts. So, it appears
> the solr-operator can only manage rolling restarts when there is one
> SolrCloud object in the cluster.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 12, 2021 at 6:44 PM Joel Bernstein <jo...@gmail.com> wrote:
>
> > Hi,
> >
> > I saw that the Solr operator takes into account collection topology when
> > performing rolling restarts. In a situation where there is one SolrCloud
> > object per-shard, I'm wondering how this will behave. In this case the
> Solr
> > Operator would receive a different CR for each shard which would kick off
> > the rolling restarts in parallel. Would the operator be able to
> understand
> > that it was operating on a single shard in each CR and not get tangled up
> > in the larger cluster state?
> >
> > Thanks,
> > Joel
> >
> >
> >
>

Re: Rolling restarts and the Solr Operator

Posted by Joel Bernstein <jo...@gmail.com>.

This is a followup to my last question with my findings thus far. In a
scenario where there is one SolrCloud resource per-shard I'm seeing the
overseer node get skipped entirely during rolling restarts. So, it appears
the solr-operator can only manage rolling restarts when there is one
SolrCloud object in the cluster.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Oct 12, 2021 at 6:44 PM Joel Bernstein <jo...@gmail.com> wrote:

> Hi,
>
> I saw that the Solr operator takes into account collection topology when
> performing rolling restarts. In a situation where there is one SolrCloud
> object per-shard, I'm wondering how this will behave. In this case the Solr
> Operator would receive a different CR for each shard which would kick off
> the rolling restarts in parallel. Would the operator be able to understand
> that it was operating on a single shard in each CR and not get tangled up
> in the larger cluster state?
>
> Thanks,
> Joel
>
>
>