You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by David Smiley <ds...@apache.org> on 2022/10/11 22:34:38 UTC

JIT Shard leader design/proposal

At work, I’ve attempted to troubleshoot issues relating to shard
leadership.  It’s quite possible that the root-causes may be related to
customizations in my fork of Solr; who knows.  The leadership
code/algorithm is so hard to debug/troubleshoot that it’s hard to say.
It’s no secret that Solr’s code here is a complicated puzzle[1].  Out of
this frustration, I began to ponder a fantasy of how I want leader election
to work, informed by my desire to scale to massive numbers of collections &
shards on a cluster.  Using Curator for elections would perhaps address
stability but not this scale.  I’d like to get input from you all on this
fantasy.  Surely I have overlooked things; please offer your insights!

Thematic concept:  Don’t change/elect leaders until it’s actually
necessary.  In most cases where I work, the leader will return before we
truly need a leader.  Even when not true, I don’t think doing it lazily
should be a noticeable issue?  If so, it’s easy to imagine augmenting this
design with an optional eager leadership election.

A. Only code paths that truly need a leader will do “leadership checks”,
resulting in a potential leader election.  This is principally on indexing
in DistributedZkUpdateProcessor but there are likely more spots.

B. Leader check: Check if the shard’s leader is (a) known, and (b)
state=ACTIVE, and (c) on a “live” node, and (d) the preferredLeader
condition is satisfied.  Otherwise, try to elect a leader in a loop until
this set of conditions is achieved or a timeout is reached.
B.A: The preferredLeader condition means that either the leader is marked
as preferredLeader, or no replica with preferredLeader is eligible for
leadership.

C. “Try to elect a leader”:   (The word “election” might not be the best
word for this algorithm, but whatever).
C.1.: A replica must be eligible to be a leader.  It must be live (on a
live node) and have an ACTIVE state.  And, very important, eligibility
should be governed by ZkShardTerms which knows which replicas have the most
up-to-date state.
C.1.A: Strict use of ZkShardTerms is designed to ensure that there is no
data loss.  That said “forceLeader” remains in the toolbox of Solr admins
(which monkey’s with ZkShardTerms to cheat).  We may need a new optional
mechanisms to be closer to what we have today — to basically ignore
ZkShardTerms after a configured period of time?
C.1.B. I am assuming that replicas will become eligible on their own (e.g.
as nodes re-join) instead of this algorithm needing to initiate/tell any to
get into this state somehow.
C.2: If there are no leader-eligible replicas, complain with useful
information to diagnose why no leader was found.  Don’t log this if we
already logged this same message in our leadership check loop.  Sleep
perhaps 1000ms and try the loop again.  If we can wait/monitor on the state
of something convenient then do that to avoid sleeping for too long.
C.3: Of the leader-eligible replicas — pick whichever one as the leader
(e.g. random).  Prefer preferredLeader=true replicas, of course.  ZK will
solve races if this algorithm runs on more than one node.

D. Only track leadership in Slice (i.e. within the cluster state) which is
backed by one spot in ZK.  Don’t put it in places like CloudDescriptor or
other places in ZK.

Thoughts?


[1]
https://lists.apache.org/list?dev@solr.apache.org:2021-10:MILLER%20leader
“ZkCmdExecutor” thread with Mark Miller, and referencing
https://www.solrdev.io/leader-election-adventure.html which no longer
resolves


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Re: JIT Shard leader design/proposal

Posted by Jan Høydahl <ja...@cominvent.com>.
Agree too,

Are you trying to optimize for nodes in a large cluster going down and coming up again within a short time period withuot another replica being elected leader?

I'd prefer if Solr any rewrite would use proven recipies from e.g. Curator instead of rolling our own.
If we have complex election rules, then perhaps we could have several election groups, so first try to elect from nodes in preferred leaders group, then if that fails, elect between remaining eligible nodes.

Jan

> 14. okt. 2022 kl. 12:21 skrev Noble Paul <no...@gmail.com>:
> 
> "just in time is probably less than ideal for most of the more common uses
> cases."
> 
> Agree
> 
> On Fri, Oct 14, 2022, 9:11 PM Mark Miller <ma...@gmail.com> wrote:
> 
>> I don’t have much to say about the proposal, other than to say that if an
>> election ever ends up involving syncing up and exchanging data, doing that
>> just in time is probably less than ideal for most of the more common uses
>> cases.
>> 
>> That’s just an aside though. Id be more interested in seeing the proposal
>> connect problems with solutions. My quick read makes me think the goal is
>> some dimension of scale (I’m guessing a lazy dimension, usually no the most
>> common Solr architecture in my experience fwiw). But I don’t see what the
>> problems are for that dimension of scale or how to connect proposals to
>> solutions to the problems. Unless I’m just missing it.
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: JIT Shard leader design/proposal

Posted by Noble Paul <no...@gmail.com>.
"just in time is probably less than ideal for most of the more common uses
cases."

Agree

On Fri, Oct 14, 2022, 9:11 PM Mark Miller <ma...@gmail.com> wrote:

> I don’t have much to say about the proposal, other than to say that if an
> election ever ends up involving syncing up and exchanging data, doing that
> just in time is probably less than ideal for most of the more common uses
> cases.
>
> That’s just an aside though. Id be more interested in seeing the proposal
> connect problems with solutions. My quick read makes me think the goal is
> some dimension of scale (I’m guessing a lazy dimension, usually no the most
> common Solr architecture in my experience fwiw). But I don’t see what the
> problems are for that dimension of scale or how to connect proposals to
> solutions to the problems. Unless I’m just missing it.
>

Re: JIT Shard leader design/proposal

Posted by David Smiley <ds...@apache.org>.
I'm trying to understand the needs of a "typical case" better with regards
to this proposed design and how it would be negatively impacted.  Maybe not
at all for NRT as any up-to-date replica can be cheaply made the leader, so
it doesn't matter when.  A TLOG non-leader has to replay (uses a bunch of
threads on the node).  In the proposal, this is work that would be
completely avoided if the node is unavailable for a duration of time short
enough such that there is no indexing.  In the so-called "typical case", I
suppose this could be seen as doing work to prepare ourselves to be able to
index docs right away if one comes in during this period so that we can
optimize for indexing availability / performance instead?  I think this
could easily be a configurable option such that a TLOG replica would
observe the non-availability in its leader so that it might take charge and
be leader eagerly.

> Maybe basic improvements like that

There are already basic node limits for replaying the update log, from what
I see.  replayUpdateThreads mainly.  It defaults to the number of CPU
threads.  Perhaps in systems you see, it's configured to 500?  Based on my
recollection of some replay challenges with document versions & locks that
Dat & I worked on, I could see how increasing it would be helpful.  There
is no cap on the number of replays happening, which I could see us wanting
to do in order to speed up how soon a replica that is already replaying
could become ready.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Oct 17, 2022 at 3:46 AM Mark Miller <ma...@gmail.com> wrote:

> Determining the leader is extremely cheap in the general case. It’s when
> you have to exchange data (generally when that exchange involves
> replication) that’s expensive. Or when you spin up 500 threads for 500
> cheap operations. For the common use case, a very basic and long needed
> feature in that regard is simple management. Rather then flood the system
> at once with 500 replications, there needs to be a gate on how many
> expensive operations like that can occur at once. Same with spinning up 500
> threads. Maybe basic improvements like that won’t be the ideal end game for
> a system that wants 100,000 lazy cores where most of them are rarely
> active, but there is always going to be lots of tension trying to solve for
> the typical use and a system like that.
>

Re: JIT Shard leader design/proposal

Posted by Mark Miller <ma...@gmail.com>.
Determining the leader is extremely cheap in the general case. It’s when
you have to exchange data (generally when that exchange involves
replication) that’s expensive. Or when you spin up 500 threads for 500
cheap operations. For the common use case, a very basic and long needed
feature in that regard is simple management. Rather then flood the system
at once with 500 replications, there needs to be a gate on how many
expensive operations like that can occur at once. Same with spinning up 500
threads. Maybe basic improvements like that won’t be the ideal end game for
a system that wants 100,000 lazy cores where most of them are rarely
active, but there is always going to be lots of tension trying to solve for
the typical use and a system like that.

Re: JIT Shard leader design/proposal

Posted by David Smiley <ds...@apache.org>.
One issue I see is that a TLOG non-leader can't simply be declared a leader
at any time, just because it's up to date via ZkShardTerms.  It will need
to replay its updateLog first.  I don't think that can be delayed till when
it receives a doc to index because queries assume a leader is up to date,
certainly for RTG at least.  The algorithm I listed could be changed such
that the chosen replica is informed so that it can do the needful itself.
This also acts as an implicit extra liveness check that the chosen leader
is reachable.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Oct 14, 2022 at 8:46 AM David Smiley <ds...@apache.org> wrote:

> And to emphasize the brilliance of ZkShardTerms (in turn, on RAFT which is
> it's basis/genesis), we might not even need replica states.  ZkShardTerms +
> live_nodes is probably enough.  Credit to Dat on this; we were just in an
> email exchange about this stuff.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Fri, Oct 14, 2022 at 8:41 AM David Smiley <ds...@apache.org> wrote:
>
>> On Fri, Oct 14, 2022 at 6:11 AM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> I don’t have much to say about the proposal, other than to say that if an
>>> election ever ends up involving syncing up and exchanging data, doing
>>> that
>>> just in time is probably less than ideal for most of the more common uses
>>> cases.
>>>
>>
>> I should emphasize that replicas continue to want to sync-up with their
>> leaders on their own -- eagerly.  See RecoveringCoreTermWatcher.  I could
>> imagine an option to allow one to have this be lazy as well but I'm not
>> proposing that now.
>>
>>
>>> That’s just an aside though. Id be more interested in seeing the proposal
>>> connect problems with solutions.
>>
>>
>> The #1 point of the proposal is robustness/stability and not a specific
>> bug.  Instead of wondering why there isn't a leader, this proposal would
>> log information about the pertinent state and elect a leader if one is
>> eligible.  No wondering why one wasn't elected.  Still, of course we might
>> wonder other things based on the logged information (of course), but you're
>> then closer to debugging what's happening.  This is more user/operator
>> friendly.
>>
>>
>>> My quick read makes me think the goal is
>>> some dimension of scale (I’m guessing a lazy dimension, usually no the
>>> most
>>> common Solr architecture in my experience fwiw). But I don’t see what the
>>> problems are for that dimension of scale or how to connect proposals to
>>> solutions to the problems. Unless I’m just missing it.
>>>
>>
>> I think the proposal somewhat indirectly addresses a dimension of scale
>> involving tens of thousands of shards (2x more replicas) across a large
>> SolrCloud cluster, and using a simple node restart as an example.  Assuming
>> replicas continue to try to be in sync based on ZkShardTerms (sorry I
>> wasn't clear on that), the actual choice of who is leader need not happen
>> eagerly; it's premature I say.
>>
>> BTW, the proposal's strategy is complementary with additional leader
>> algorithms being in-place, though I don't think we need both.  The most
>> important mechanism is ZkShardTerms which is already in-place to govern
>> leader eligibility; it's brilliant. With that complicated problem already
>> solved, we're merely left with simply picking an eligible replica and doing
>> it (recording it in ZK) -- so just pick one already and be done with it :-)
>>
>

Re: JIT Shard leader design/proposal

Posted by David Smiley <ds...@apache.org>.
And to emphasize the brilliance of ZkShardTerms (in turn, on RAFT which is
it's basis/genesis), we might not even need replica states.  ZkShardTerms +
live_nodes is probably enough.  Credit to Dat on this; we were just in an
email exchange about this stuff.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Oct 14, 2022 at 8:41 AM David Smiley <ds...@apache.org> wrote:

> On Fri, Oct 14, 2022 at 6:11 AM Mark Miller <ma...@gmail.com> wrote:
>
>> I don’t have much to say about the proposal, other than to say that if an
>> election ever ends up involving syncing up and exchanging data, doing that
>> just in time is probably less than ideal for most of the more common uses
>> cases.
>>
>
> I should emphasize that replicas continue to want to sync-up with their
> leaders on their own -- eagerly.  See RecoveringCoreTermWatcher.  I could
> imagine an option to allow one to have this be lazy as well but I'm not
> proposing that now.
>
>
>> That’s just an aside though. Id be more interested in seeing the proposal
>> connect problems with solutions.
>
>
> The #1 point of the proposal is robustness/stability and not a specific
> bug.  Instead of wondering why there isn't a leader, this proposal would
> log information about the pertinent state and elect a leader if one is
> eligible.  No wondering why one wasn't elected.  Still, of course we might
> wonder other things based on the logged information (of course), but you're
> then closer to debugging what's happening.  This is more user/operator
> friendly.
>
>
>> My quick read makes me think the goal is
>> some dimension of scale (I’m guessing a lazy dimension, usually no the
>> most
>> common Solr architecture in my experience fwiw). But I don’t see what the
>> problems are for that dimension of scale or how to connect proposals to
>> solutions to the problems. Unless I’m just missing it.
>>
>
> I think the proposal somewhat indirectly addresses a dimension of scale
> involving tens of thousands of shards (2x more replicas) across a large
> SolrCloud cluster, and using a simple node restart as an example.  Assuming
> replicas continue to try to be in sync based on ZkShardTerms (sorry I
> wasn't clear on that), the actual choice of who is leader need not happen
> eagerly; it's premature I say.
>
> BTW, the proposal's strategy is complementary with additional leader
> algorithms being in-place, though I don't think we need both.  The most
> important mechanism is ZkShardTerms which is already in-place to govern
> leader eligibility; it's brilliant. With that complicated problem already
> solved, we're merely left with simply picking an eligible replica and doing
> it (recording it in ZK) -- so just pick one already and be done with it :-)
>

Re: JIT Shard leader design/proposal

Posted by David Smiley <ds...@apache.org>.
On Fri, Oct 14, 2022 at 6:11 AM Mark Miller <ma...@gmail.com> wrote:

> I don’t have much to say about the proposal, other than to say that if an
> election ever ends up involving syncing up and exchanging data, doing that
> just in time is probably less than ideal for most of the more common uses
> cases.
>

I should emphasize that replicas continue to want to sync-up with their
leaders on their own -- eagerly.  See RecoveringCoreTermWatcher.  I could
imagine an option to allow one to have this be lazy as well but I'm not
proposing that now.


> That’s just an aside though. Id be more interested in seeing the proposal
> connect problems with solutions.


The #1 point of the proposal is robustness/stability and not a specific
bug.  Instead of wondering why there isn't a leader, this proposal would
log information about the pertinent state and elect a leader if one is
eligible.  No wondering why one wasn't elected.  Still, of course we might
wonder other things based on the logged information (of course), but you're
then closer to debugging what's happening.  This is more user/operator
friendly.


> My quick read makes me think the goal is
> some dimension of scale (I’m guessing a lazy dimension, usually no the most
> common Solr architecture in my experience fwiw). But I don’t see what the
> problems are for that dimension of scale or how to connect proposals to
> solutions to the problems. Unless I’m just missing it.
>

I think the proposal somewhat indirectly addresses a dimension of scale
involving tens of thousands of shards (2x more replicas) across a large
SolrCloud cluster, and using a simple node restart as an example.  Assuming
replicas continue to try to be in sync based on ZkShardTerms (sorry I
wasn't clear on that), the actual choice of who is leader need not happen
eagerly; it's premature I say.

BTW, the proposal's strategy is complementary with additional leader
algorithms being in-place, though I don't think we need both.  The most
important mechanism is ZkShardTerms which is already in-place to govern
leader eligibility; it's brilliant. With that complicated problem already
solved, we're merely left with simply picking an eligible replica and doing
it (recording it in ZK) -- so just pick one already and be done with it :-)

Re: JIT Shard leader design/proposal

Posted by Mark Miller <ma...@gmail.com>.
I don’t have much to say about the proposal, other than to say that if an
election ever ends up involving syncing up and exchanging data, doing that
just in time is probably less than ideal for most of the more common uses
cases.

That’s just an aside though. Id be more interested in seeing the proposal
connect problems with solutions. My quick read makes me think the goal is
some dimension of scale (I’m guessing a lazy dimension, usually no the most
common Solr architecture in my experience fwiw). But I don’t see what the
problems are for that dimension of scale or how to connect proposals to
solutions to the problems. Unless I’m just missing it.

Re: JIT Shard leader design/proposal

Posted by David Smiley <ds...@apache.org>.
On Thu, Oct 13, 2022 at 10:28 AM Bruno Roustant <br...@gmail.com>
wrote:

> I don't know enough how the current leader election mechanism works, yet.
> I miss a comparison between this proposal and the current mechanism.
>
> B. With this proposal, each time we need the leader, we check it. What is
> the cost of this check? Do we need to read the cluster state each time?
>

DocCollection and friends are updated mainly via ZK watchers (i.e. async;
not in the critical path).  I proposed below that client/server could pass
ZK version hints to let either side know that it ought to explicitly go to
ZK when the states differ.  Regardless; "each time" (an update comes), we
need not get the state from ZK. The state could be stale (as it can be
today!).  It will interact with other replicas (either as a leader or
non-leader) and discover it's acting on stale information, *then* update
itself.


> C.1.A "Strict use of ZkShardTerms". Does that mean that we wait for one
> replica to become up to date?


Yes.


> Does that mechanism already exist?
>

Yes, certainly.  Replicas sync with the leader in order to become
state=ACTIVE.  Thanks to ZkShardTerms, they know if they are caught up, and
are thus already leader-eligible.  I see extra complexity around leadership
that perhaps pre-dated ZkShardTerms.


> In the meantime, if the previous leader comes back, we can choose it.
>

Yes; I think this is rather typical; maybe the leader's node simply
restarted.

~ David

Le jeu. 13 oct. 2022 à 16:24, David Smiley <ds...@apache.org> a écrit :
>
> > "JIT" is Just-In-Time; a way of looking at it.  Or call it on-demand
> leader
> > elections.
> >
> > A property of my proposal that may not be obvious is that
> REBALANCELEADERS
> > would be needless since preferredLeaders would become leaders
> automatically
> > on-demand (when a leader is next needed).
> >
> > It's not clear how amenable Curator's election recipe is to
> preferredLeader
> > and ZkShardTerms requirements; so using Curator for elections might
> simply
> > not be an option?
> >
> > Here's another thematic concept that is tangentially related; maybe it's
> > worth its own discussion:  Solr's view of state from ZK could become
> stale
> > at any time.  Embrace sharing of ZK node versions between client and
> server
> > to trigger the need for either side to update its state from ZK.
> Watchers
> > are nice but we shouldn't rely on them for correctness.  Put differently,
> > SolrCloud should function (albeit perhaps slowly) if its use of ZK
> Watchers
> > stopped working.  Today, CloudSolrClient passes _stateVer_ and it's
> parsed
> > on the server by
> > org.apache.solr.servlet.HttpSolrCall#checkStateVersionsAreValid.  This
> > should be used more pervasively in most interactions, not merely a subset
> > of uses of CloudSolrClient.  I think it might only be used to tell the
> > client to get more up to date but ideally it'd be bidirectional.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Tue, Oct 11, 2022 at 6:34 PM David Smiley <ds...@apache.org> wrote:
> >
> > > At work, I’ve attempted to troubleshoot issues relating to shard
> > > leadership.  It’s quite possible that the root-causes may be related to
> > > customizations in my fork of Solr; who knows.  The leadership
> > > code/algorithm is so hard to debug/troubleshoot that it’s hard to say.
> > > It’s no secret that Solr’s code here is a complicated puzzle[1].  Out
> of
> > > this frustration, I began to ponder a fantasy of how I want leader
> > election
> > > to work, informed by my desire to scale to massive numbers of
> > collections &
> > > shards on a cluster.  Using Curator for elections would perhaps address
> > > stability but not this scale.  I’d like to get input from you all on
> this
> > > fantasy.  Surely I have overlooked things; please offer your insights!
> > >
> > > Thematic concept:  Don’t change/elect leaders until it’s actually
> > > necessary.  In most cases where I work, the leader will return before
> we
> > > truly need a leader.  Even when not true, I don’t think doing it lazily
> > > should be a noticeable issue?  If so, it’s easy to imagine augmenting
> > this
> > > design with an optional eager leadership election.
> > >
> > > A. Only code paths that truly need a leader will do “leadership
> checks”,
> > > resulting in a potential leader election.  This is principally on
> > indexing
> > > in DistributedZkUpdateProcessor but there are likely more spots.
> > >
> > > B. Leader check: Check if the shard’s leader is (a) known, and (b)
> > > state=ACTIVE, and (c) on a “live” node, and (d) the preferredLeader
> > > condition is satisfied.  Otherwise, try to elect a leader in a loop
> until
> > > this set of conditions is achieved or a timeout is reached.
> > > B.A: The preferredLeader condition means that either the leader is
> marked
> > > as preferredLeader, or no replica with preferredLeader is eligible for
> > > leadership.
> > >
> > > C. “Try to elect a leader”:   (The word “election” might not be the
> best
> > > word for this algorithm, but whatever).
> > > C.1.: A replica must be eligible to be a leader.  It must be live (on a
> > > live node) and have an ACTIVE state.  And, very important, eligibility
> > > should be governed by ZkShardTerms which knows which replicas have the
> > most
> > > up-to-date state.
> > > C.1.A: Strict use of ZkShardTerms is designed to ensure that there is
> no
> > > data loss.  That said “forceLeader” remains in the toolbox of Solr
> admins
> > > (which monkey’s with ZkShardTerms to cheat).  We may need a new
> optional
> > > mechanisms to be closer to what we have today — to basically ignore
> > > ZkShardTerms after a configured period of time?
> > > C.1.B. I am assuming that replicas will become eligible on their own
> > (e.g.
> > > as nodes re-join) instead of this algorithm needing to initiate/tell
> any
> > to
> > > get into this state somehow.
> > > C.2: If there are no leader-eligible replicas, complain with useful
> > > information to diagnose why no leader was found.  Don’t log this if we
> > > already logged this same message in our leadership check loop.  Sleep
> > > perhaps 1000ms and try the loop again.  If we can wait/monitor on the
> > state
> > > of something convenient then do that to avoid sleeping for too long.
> > > C.3: Of the leader-eligible replicas — pick whichever one as the leader
> > > (e.g. random).  Prefer preferredLeader=true replicas, of course.  ZK
> will
> > > solve races if this algorithm runs on more than one node.
> > >
> > > D. Only track leadership in Slice (i.e. within the cluster state) which
> > is
> > > backed by one spot in ZK.  Don’t put it in places like CloudDescriptor
> or
> > > other places in ZK.
> > >
> > > Thoughts?
> > >
> > >
> > > [1]
> > >
> >
> https://lists.apache.org/list?dev@solr.apache.org:2021-10:MILLER%20leader
> > > “ZkCmdExecutor” thread with Mark Miller, and referencing
> > > https://www.solrdev.io/leader-election-adventure.html which no longer
> > > resolves
> > >
> > >
> > > ~ David Smiley
> > > Apache Lucene/Solr Search Developer
> > > http://www.linkedin.com/in/davidwsmiley
> > >
> >
>

Re: JIT Shard leader design/proposal

Posted by Bruno Roustant <br...@gmail.com>.
I don't know enough how the current leader election mechanism works, yet.
I miss a comparison between this proposal and the current mechanism.

B. With this proposal, each time we need the leader, we check it. What is
the cost of this check? Do we need to read the cluster state each time?

C.1.A "Strict use of ZkShardTerms". Does that mean that we wait for one
replica to become up to date? Does that mechanism already exist?
In the meantime, if the previous leader comes back, we can choose it.

Le jeu. 13 oct. 2022 à 16:24, David Smiley <ds...@apache.org> a écrit :

> "JIT" is Just-In-Time; a way of looking at it.  Or call it on-demand leader
> elections.
>
> A property of my proposal that may not be obvious is that REBALANCELEADERS
> would be needless since preferredLeaders would become leaders automatically
> on-demand (when a leader is next needed).
>
> It's not clear how amenable Curator's election recipe is to preferredLeader
> and ZkShardTerms requirements; so using Curator for elections might simply
> not be an option?
>
> Here's another thematic concept that is tangentially related; maybe it's
> worth its own discussion:  Solr's view of state from ZK could become stale
> at any time.  Embrace sharing of ZK node versions between client and server
> to trigger the need for either side to update its state from ZK.  Watchers
> are nice but we shouldn't rely on them for correctness.  Put differently,
> SolrCloud should function (albeit perhaps slowly) if its use of ZK Watchers
> stopped working.  Today, CloudSolrClient passes _stateVer_ and it's parsed
> on the server by
> org.apache.solr.servlet.HttpSolrCall#checkStateVersionsAreValid.  This
> should be used more pervasively in most interactions, not merely a subset
> of uses of CloudSolrClient.  I think it might only be used to tell the
> client to get more up to date but ideally it'd be bidirectional.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Oct 11, 2022 at 6:34 PM David Smiley <ds...@apache.org> wrote:
>
> > At work, I’ve attempted to troubleshoot issues relating to shard
> > leadership.  It’s quite possible that the root-causes may be related to
> > customizations in my fork of Solr; who knows.  The leadership
> > code/algorithm is so hard to debug/troubleshoot that it’s hard to say.
> > It’s no secret that Solr’s code here is a complicated puzzle[1].  Out of
> > this frustration, I began to ponder a fantasy of how I want leader
> election
> > to work, informed by my desire to scale to massive numbers of
> collections &
> > shards on a cluster.  Using Curator for elections would perhaps address
> > stability but not this scale.  I’d like to get input from you all on this
> > fantasy.  Surely I have overlooked things; please offer your insights!
> >
> > Thematic concept:  Don’t change/elect leaders until it’s actually
> > necessary.  In most cases where I work, the leader will return before we
> > truly need a leader.  Even when not true, I don’t think doing it lazily
> > should be a noticeable issue?  If so, it’s easy to imagine augmenting
> this
> > design with an optional eager leadership election.
> >
> > A. Only code paths that truly need a leader will do “leadership checks”,
> > resulting in a potential leader election.  This is principally on
> indexing
> > in DistributedZkUpdateProcessor but there are likely more spots.
> >
> > B. Leader check: Check if the shard’s leader is (a) known, and (b)
> > state=ACTIVE, and (c) on a “live” node, and (d) the preferredLeader
> > condition is satisfied.  Otherwise, try to elect a leader in a loop until
> > this set of conditions is achieved or a timeout is reached.
> > B.A: The preferredLeader condition means that either the leader is marked
> > as preferredLeader, or no replica with preferredLeader is eligible for
> > leadership.
> >
> > C. “Try to elect a leader”:   (The word “election” might not be the best
> > word for this algorithm, but whatever).
> > C.1.: A replica must be eligible to be a leader.  It must be live (on a
> > live node) and have an ACTIVE state.  And, very important, eligibility
> > should be governed by ZkShardTerms which knows which replicas have the
> most
> > up-to-date state.
> > C.1.A: Strict use of ZkShardTerms is designed to ensure that there is no
> > data loss.  That said “forceLeader” remains in the toolbox of Solr admins
> > (which monkey’s with ZkShardTerms to cheat).  We may need a new optional
> > mechanisms to be closer to what we have today — to basically ignore
> > ZkShardTerms after a configured period of time?
> > C.1.B. I am assuming that replicas will become eligible on their own
> (e.g.
> > as nodes re-join) instead of this algorithm needing to initiate/tell any
> to
> > get into this state somehow.
> > C.2: If there are no leader-eligible replicas, complain with useful
> > information to diagnose why no leader was found.  Don’t log this if we
> > already logged this same message in our leadership check loop.  Sleep
> > perhaps 1000ms and try the loop again.  If we can wait/monitor on the
> state
> > of something convenient then do that to avoid sleeping for too long.
> > C.3: Of the leader-eligible replicas — pick whichever one as the leader
> > (e.g. random).  Prefer preferredLeader=true replicas, of course.  ZK will
> > solve races if this algorithm runs on more than one node.
> >
> > D. Only track leadership in Slice (i.e. within the cluster state) which
> is
> > backed by one spot in ZK.  Don’t put it in places like CloudDescriptor or
> > other places in ZK.
> >
> > Thoughts?
> >
> >
> > [1]
> >
> https://lists.apache.org/list?dev@solr.apache.org:2021-10:MILLER%20leader
> > “ZkCmdExecutor” thread with Mark Miller, and referencing
> > https://www.solrdev.io/leader-election-adventure.html which no longer
> > resolves
> >
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
>

Re: JIT Shard leader design/proposal

Posted by David Smiley <ds...@apache.org>.
"JIT" is Just-In-Time; a way of looking at it.  Or call it on-demand leader
elections.

A property of my proposal that may not be obvious is that REBALANCELEADERS
would be needless since preferredLeaders would become leaders automatically
on-demand (when a leader is next needed).

It's not clear how amenable Curator's election recipe is to preferredLeader
and ZkShardTerms requirements; so using Curator for elections might simply
not be an option?

Here's another thematic concept that is tangentially related; maybe it's
worth its own discussion:  Solr's view of state from ZK could become stale
at any time.  Embrace sharing of ZK node versions between client and server
to trigger the need for either side to update its state from ZK.  Watchers
are nice but we shouldn't rely on them for correctness.  Put differently,
SolrCloud should function (albeit perhaps slowly) if its use of ZK Watchers
stopped working.  Today, CloudSolrClient passes _stateVer_ and it's parsed
on the server by
org.apache.solr.servlet.HttpSolrCall#checkStateVersionsAreValid.  This
should be used more pervasively in most interactions, not merely a subset
of uses of CloudSolrClient.  I think it might only be used to tell the
client to get more up to date but ideally it'd be bidirectional.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Oct 11, 2022 at 6:34 PM David Smiley <ds...@apache.org> wrote:

> At work, I’ve attempted to troubleshoot issues relating to shard
> leadership.  It’s quite possible that the root-causes may be related to
> customizations in my fork of Solr; who knows.  The leadership
> code/algorithm is so hard to debug/troubleshoot that it’s hard to say.
> It’s no secret that Solr’s code here is a complicated puzzle[1].  Out of
> this frustration, I began to ponder a fantasy of how I want leader election
> to work, informed by my desire to scale to massive numbers of collections &
> shards on a cluster.  Using Curator for elections would perhaps address
> stability but not this scale.  I’d like to get input from you all on this
> fantasy.  Surely I have overlooked things; please offer your insights!
>
> Thematic concept:  Don’t change/elect leaders until it’s actually
> necessary.  In most cases where I work, the leader will return before we
> truly need a leader.  Even when not true, I don’t think doing it lazily
> should be a noticeable issue?  If so, it’s easy to imagine augmenting this
> design with an optional eager leadership election.
>
> A. Only code paths that truly need a leader will do “leadership checks”,
> resulting in a potential leader election.  This is principally on indexing
> in DistributedZkUpdateProcessor but there are likely more spots.
>
> B. Leader check: Check if the shard’s leader is (a) known, and (b)
> state=ACTIVE, and (c) on a “live” node, and (d) the preferredLeader
> condition is satisfied.  Otherwise, try to elect a leader in a loop until
> this set of conditions is achieved or a timeout is reached.
> B.A: The preferredLeader condition means that either the leader is marked
> as preferredLeader, or no replica with preferredLeader is eligible for
> leadership.
>
> C. “Try to elect a leader”:   (The word “election” might not be the best
> word for this algorithm, but whatever).
> C.1.: A replica must be eligible to be a leader.  It must be live (on a
> live node) and have an ACTIVE state.  And, very important, eligibility
> should be governed by ZkShardTerms which knows which replicas have the most
> up-to-date state.
> C.1.A: Strict use of ZkShardTerms is designed to ensure that there is no
> data loss.  That said “forceLeader” remains in the toolbox of Solr admins
> (which monkey’s with ZkShardTerms to cheat).  We may need a new optional
> mechanisms to be closer to what we have today — to basically ignore
> ZkShardTerms after a configured period of time?
> C.1.B. I am assuming that replicas will become eligible on their own (e.g.
> as nodes re-join) instead of this algorithm needing to initiate/tell any to
> get into this state somehow.
> C.2: If there are no leader-eligible replicas, complain with useful
> information to diagnose why no leader was found.  Don’t log this if we
> already logged this same message in our leadership check loop.  Sleep
> perhaps 1000ms and try the loop again.  If we can wait/monitor on the state
> of something convenient then do that to avoid sleeping for too long.
> C.3: Of the leader-eligible replicas — pick whichever one as the leader
> (e.g. random).  Prefer preferredLeader=true replicas, of course.  ZK will
> solve races if this algorithm runs on more than one node.
>
> D. Only track leadership in Slice (i.e. within the cluster state) which is
> backed by one spot in ZK.  Don’t put it in places like CloudDescriptor or
> other places in ZK.
>
> Thoughts?
>
>
> [1]
> https://lists.apache.org/list?dev@solr.apache.org:2021-10:MILLER%20leader
> “ZkCmdExecutor” thread with Mark Miller, and referencing
> https://www.solrdev.io/leader-election-adventure.html which no longer
> resolves
>
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>