You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@solr.apache.org by Pierre Salagnac <pi...@gmail.com> on 2023/03/28 14:46:54 UTC

shard with no leader if Zookeeper session expires at an unlucky moment

Hello everyone,
I'm investigating issues where a replica ends in having no leader, and I
wonder whether my specified cases were already discussed somewhere.

More specifically in the code, I (with the help of my colleagues)
identified two gaps where we exit the leadership process, without going
back to it ever. Both of them happen when the election ephemeral node is
dropped because the Zookeeper session expired.

First one, in class LeaderElector:
- we log *"Our node is no longer in line to be leader"*
- and immediately return

Second one, in class
* - we log "Will not register as leader because it seems the election is no
longer taking place."*
 - and immediately return

For both cases, we explicitly check our sequential node still exists in the
election. First case has a call to zkClient.getChildren(...) and we then
validate the results, while the second case catches a NoNodeException.
If I don't miss anything, the node won't get back to this election. Since
we aborted, this allows other eventual nodes to be the leader for this
shard. But if they're not there (and we are), we just can't be the leader.


Taking a step back, it seems to me error handling in the leader election
code is messy. There are a large number of catch blocks. Some of them
trigger a retry of the election while some of them don't.

Are they issues that were already discussed ?
Thanks

Re: shard with no leader if Zookeeper session expires at an unlucky moment

Posted by Houston Putman <ho...@apache.org>.

So I think everyone would agree that the leader election logic is messy and
there is lots of room for improvement.

The ultimate goal is to use Apache curator to eventually replace most of
our complex zookeeper logic. However for an annoying reason, that work has
stalled for the past year.

I think everyone would agree that the leader election logic, while usually
good, is often the source of pain for people running/managing Solr.
I think fixing these issues piecemeal is probably the way to go until we
can continue on our long-awaited Curator migration.
Would you mind opening an issue/PR to tackle what you found?
It'll probably be easier to discuss the specifics there.

- Houston

On Tue, Mar 28, 2023 at 10:47 AM Pierre Salagnac <pi...@gmail.com>
wrote:

> Hello everyone,
> I'm investigating issues where a replica ends in having no leader, and I
> wonder whether my specified cases were already discussed somewhere.
>
> More specifically in the code, I (with the help of my colleagues)
> identified two gaps where we exit the leadership process, without going
> back to it ever. Both of them happen when the election ephemeral node is
> dropped because the Zookeeper session expired.
>
> First one, in class LeaderElector:
> - we log *"Our node is no longer in line to be leader"*
> - and immediately return
>
> Second one, in class
> * - we log "Will not register as leader because it seems the election is no
> longer taking place."*
>  - and immediately return
>
> For both cases, we explicitly check our sequential node still exists in the
> election. First case has a call to zkClient.getChildren(...) and we then
> validate the results, while the second case catches a NoNodeException.
> If I don't miss anything, the node won't get back to this election. Since
> we aborted, this allows other eventual nodes to be the leader for this
> shard. But if they're not there (and we are), we just can't be the leader.
>
>
> Taking a step back, it seems to me error handling in the leader election
> code is messy. There are a large number of catch blocks. Some of them
> trigger a retry of the election while some of them don't.
>
> Are they issues that were already discussed ?
> Thanks
>

Re: shard with no leader if Zookeeper session expires at an unlucky moment

Posted by David Smiley <ds...@apache.org>.

We have a fix for this issue, if I recall.  The version is 8.6 but the
problem likely existed long before -- there's been little improvements that
I've seen to deep SolrCloud internals in recent years.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, May 18, 2023 at 11:49 PM Gus Heck <gu...@gmail.com> wrote:

> What version of solr is having this problem?
>
> On Tue, Mar 28, 2023 at 10:47 AM Pierre Salagnac <
> pierre.salagnac@gmail.com>
> wrote:
>
> > Hello everyone,
> > I'm investigating issues where a replica ends in having no leader, and I
> > wonder whether my specified cases were already discussed somewhere.
> >
> > More specifically in the code, I (with the help of my colleagues)
> > identified two gaps where we exit the leadership process, without going
> > back to it ever. Both of them happen when the election ephemeral node is
> > dropped because the Zookeeper session expired.
> >
> > First one, in class LeaderElector:
> > - we log *"Our node is no longer in line to be leader"*
> > - and immediately return
> >
> > Second one, in class
> > * - we log "Will not register as leader because it seems the election is
> no
> > longer taking place."*
> >  - and immediately return
> >
> > For both cases, we explicitly check our sequential node still exists in
> the
> > election. First case has a call to zkClient.getChildren(...) and we then
> > validate the results, while the second case catches a NoNodeException.
> > If I don't miss anything, the node won't get back to this election. Since
> > we aborted, this allows other eventual nodes to be the leader for this
> > shard. But if they're not there (and we are), we just can't be the
> leader.
> >
> >
> > Taking a step back, it seems to me error handling in the leader election
> > code is messy. There are a large number of catch blocks. Some of them
> > trigger a retry of the election while some of them don't.
> >
> > Are they issues that were already discussed ?
> > Thanks
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: shard with no leader if Zookeeper session expires at an unlucky moment

Posted by Gus Heck <gu...@gmail.com>.

What version of solr is having this problem?

On Tue, Mar 28, 2023 at 10:47 AM Pierre Salagnac <pi...@gmail.com>
wrote:

> Hello everyone,
> I'm investigating issues where a replica ends in having no leader, and I
> wonder whether my specified cases were already discussed somewhere.
>
> More specifically in the code, I (with the help of my colleagues)
> identified two gaps where we exit the leadership process, without going
> back to it ever. Both of them happen when the election ephemeral node is
> dropped because the Zookeeper session expired.
>
> First one, in class LeaderElector:
> - we log *"Our node is no longer in line to be leader"*
> - and immediately return
>
> Second one, in class
> * - we log "Will not register as leader because it seems the election is no
> longer taking place."*
>  - and immediately return
>
> For both cases, we explicitly check our sequential node still exists in the
> election. First case has a call to zkClient.getChildren(...) and we then
> validate the results, while the second case catches a NoNodeException.
> If I don't miss anything, the node won't get back to this election. Since
> we aborted, this allows other eventual nodes to be the leader for this
> shard. But if they're not there (and we are), we just can't be the leader.
>
>
> Taking a step back, it seems to me error handling in the leader election
> code is messy. There are a large number of catch blocks. Some of them
> trigger a retry of the election while some of them don't.
>
> Are they issues that were already discussed ?
> Thanks
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)