You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Radu Gheorghe <ra...@sematext.com> on 2022/08/01 07:36:43 UTC

Re: Trouble with REBALANCELEADERS api calls

Hi Stephen,

I would generally prefer a low value of "maxAtOnce" when calling
REBALANCELEADERS, so that I don't add too much pressure at once. Something
like 1 or 2 should be OK, unless there are other constraints that get in
the way.

I assume that if you have too many at once (and by default, it tries to do
it all at once), something might time out - maybe on the Zookeeper or
Overseer? That's where I would expect to see some logs.

Best regards,
Radu
--
Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
Sematext Cloud - Full Stack Observability
https://sematext.com/ <http://sematext.com/>


On Thu, Jul 28, 2022 at 10:06 PM Stephen Lewis Bianamara <
stephen.bianamara@gmail.com> wrote:

> Hey Solr Folks!
>
> I'm managing a Solr 8.3.1 cluster and have had trouble with the
> REBALANCELEADERS API calls.
>
> These calls seem to always fail on one or two shards (clusters ranging from
> 24 to 60 shards experiencing this problem). These failures range from
> "soft" failures (e.g. the API returns that it could not change leader) to
> "hard" failures (each node in the shard goes down).
>
> *Details*
> The cluster runs a dedicated overseer and external 3 node zk cluster, each
> on a dedicated VM. None of these, nor the instances themselves (in advance
> of the API call) seem to be particularly throttled. Nor are there any logs
> on the instance(s) which fail to give up/assume leadership.
>
> This is based on an automation script which does the following --
>
>    1. Generate the list of new preferred leaders
>    2. Iterate the shards and add the preferredLeader property to all nodes
>    we wish to be leaders, or skip if already present, waiting  3 seconds
>    between each call
>    3. Wait 30 seconds; then call REBALANCELEADERS
>
> *My questions*
>
>    1. Is there something wrong or missing with my strategy above?
>    2. Given I can't find any logs and don't see any system limitations, do
>    you have any recommendations for what to look at to trace down the
> source
>    of the issue?
>    3. Are there any improvements to this API stability in solr 8.4-9.0, or
>    planned for the future?
>
> Thanks in advance!
> Stephen
>

Re: Trouble with REBALANCELEADERS api calls

Posted by Stephen Lewis Bianamara <st...@gmail.com>.
Hi Radu,

Thanks for the advice. I'll try out setting maxAtOnce to 1 going forward.

Best,
Stephen

On Mon, Aug 1, 2022 at 12:37 AM Radu Gheorghe <ra...@sematext.com>
wrote:

> Hi Stephen,
>
> I would generally prefer a low value of "maxAtOnce" when calling
> REBALANCELEADERS, so that I don't add too much pressure at once. Something
> like 1 or 2 should be OK, unless there are other constraints that get in
> the way.
>
> I assume that if you have too many at once (and by default, it tries to do
> it all at once), something might time out - maybe on the Zookeeper or
> Overseer? That's where I would expect to see some logs.
>
> Best regards,
> Radu
> --
> Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
> Sematext Cloud - Full Stack Observability
> https://sematext.com/ <http://sematext.com/>
>
>
> On Thu, Jul 28, 2022 at 10:06 PM Stephen Lewis Bianamara <
> stephen.bianamara@gmail.com> wrote:
>
> > Hey Solr Folks!
> >
> > I'm managing a Solr 8.3.1 cluster and have had trouble with the
> > REBALANCELEADERS API calls.
> >
> > These calls seem to always fail on one or two shards (clusters ranging
> from
> > 24 to 60 shards experiencing this problem). These failures range from
> > "soft" failures (e.g. the API returns that it could not change leader) to
> > "hard" failures (each node in the shard goes down).
> >
> > *Details*
> > The cluster runs a dedicated overseer and external 3 node zk cluster,
> each
> > on a dedicated VM. None of these, nor the instances themselves (in
> advance
> > of the API call) seem to be particularly throttled. Nor are there any
> logs
> > on the instance(s) which fail to give up/assume leadership.
> >
> > This is based on an automation script which does the following --
> >
> >    1. Generate the list of new preferred leaders
> >    2. Iterate the shards and add the preferredLeader property to all
> nodes
> >    we wish to be leaders, or skip if already present, waiting  3 seconds
> >    between each call
> >    3. Wait 30 seconds; then call REBALANCELEADERS
> >
> > *My questions*
> >
> >    1. Is there something wrong or missing with my strategy above?
> >    2. Given I can't find any logs and don't see any system limitations,
> do
> >    you have any recommendations for what to look at to trace down the
> > source
> >    of the issue?
> >    3. Are there any improvements to this API stability in solr 8.4-9.0,
> or
> >    planned for the future?
> >
> > Thanks in advance!
> > Stephen
> >
>