You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zisis Tachtsidis <zi...@runbox.com> on 2015/01/12 17:54:46 UTC

SolrCloud shard leader elections - Altering zookeeper sequence numbers

SolrCloud uses ZooKeeper sequence flags to keep track of the order in which
nodes register themselves as leader candidates. The node with the lowest
sequence number wins as leader of the shard.

What I'm trying to do is to keep the leader re-assignments to the minimum
during a rolling restart. In this direction I change the zk sequence numbers
on the SolrCloud nodes when all nodes of the cluster are up and active. I'm
using Solr 4.10.0 and I'm aware of SOLR-6491 which has a similar purpose but
I'm trying to do it from "outside", using the existing APIs without editing
Solr source code.

== TYPICAL SCENARIO ==
Suppose we have 3 Solr instances S1,S2,S3. They are started in the same
order and the zk sequences assigned have as follows
S1:-n_0000000000 (LEADER)
S2:-n_0000000001
S3:-n_0000000002

In a rolling restart we'll get S2 as leader (after S1 shutdown), then S3
(after S2 shutdown) and finally S1(after S3 shutdown), 3 changes in total.

== MY ATTEMPT ==
By using SolrZkClient and the Zookeeper multi API  I found a way to get rid
of the old zknodes that participate in a shard's leader election and write
new ones where we can assign the sequence number of our liking. 

S1:-n_0000000000 (no code running here)
S2:-n_0000000004 (code deleting zknode -n_0000000001 and creating
-n_0000000004)
S3:-n_0000000003 (code deleting zknode -n_0000000002 and creating
-n_0000000003)

In a rolling restart I'd expect to have S3 as leader (after S1 shutdown), no
change (after S2 shutdown) and finally S1(after S3 shutdown), that is 2
changes. This will be constant no matter how many servers are added in
SolrCloud while in the first scenarion the # of re-assignments equals the #
of Solr servers.

The problem occurs when S1 (LEADER) is shut down. The elections that take
place still set S2 as leader, It's like ignoring the new sequence numbers.
When I go to /solr/#/~cloud?view=tree the new sequence numbers are listed
under "/collections" based on which S3 should have become the leader.
Do you have any idea why the new state is not acknowledged during the
elections? Is something cached? Or to put it bluntly do I have any chance
down this path? If not what are my options? Is it possible to apply all
patches under SOLR-6491 in isolation and continue from there?

Thank you. 

Extra info which might help follows
1. Some logging related to leader elections after S1 has been shut down
    S2 - org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with
shard failed, moving to the next candidate
    S2 - org.apache.solr.cloud.ShardLeaderElectionContext We failed sync,
but we have no versions - we can't sync in that 
           case - we were active before, so become leader anyway

    S3 - org.apache.solr.cloud.LeaderElector Our node is no longer in line
to be leader

2. And some sample code on how I perform the ZK re-sequencing
   // Read current zk nodes for a specific collection
     
solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().getChildren("/collections/core/leader_elect/shard1
      /election", true)
   // node deletion
      Op.delete(path, -1) 
   // node creation
      Op.create(createPath, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL_SEQUENTIAL);
   // Perform operations
     
solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().multi(opsList);
      solrServer.getZkStateReader().updateClusterState(true);




--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers

Posted by Erick Erickson <er...@gmail.com>.

SolrCloud is intended to work in the rolling restart case...

Index size, segment counts, segment names can (and will)
be different on different replicas of the same shard without
anything being amiss. Commits (hard) happen at different
times across the replicas in a shard. Merging logic kicks in
and may (will eventually in all probability) pick different
segments to merge, with varying numbers of deleted docs
that get purged etc.

The numFound reported on a q=*:*&distrib=false, or looking at the
core in the admin screen for the replicas in question and noting
numDocs should be identical though if
1> you've issued a hard commit with openSearcher=true _or_
     a soft commit.
2> you haven't been indexing or haven't issued a commit
     as in <1> since you started looking.

Best,
Erick

On Tue, Jan 13, 2015 at 4:20 AM, Zisis Tachtsidis <zi...@runbox.com> wrote:
> Daniel Collins wrote
>> Is it important where your leader is?  If you just want to minimize
>> leadership changes during rolling re-start, then you could restart in the
>> opposite order (S3, S2, S1).  That would give only 1 transition, but the
>> end result would be a leader on S2 instead of S1 (not sure if that
>> important to you or not).  I know its not a "fix", but it might be a
>> workaround until the whole leadership moving is done?
>
> I think that rolling restarting the machines in the opposite order
> (S3,S2,S1) will result in S3 being the leader. It's a valid approach but
> shouldn't I have to revert to the original order (S1,S2,S3) to achieve the
> same result in the following rolling restart? This includes operational
> costs and complexity that I want to avoid.
>
>
> Erick Erickson wrote
>>> Just skimming, but the problem here that I ran into was with the
>>> listeners. Each _Solr_ instance out there is listening to one of the
>>> ephemeral nodes (the "one in front"). So deleting a node does _not_
>>> change which ephemeral node the associated Solr instance is listening
>>> to.
>>>
>>> So, for instance, when you delete S2..n-000001 and re-add it, S2 is
>>> still looking at S1....n-000000 and will continue looking at
>>> S1...n-000000 until S1....n-000000 is deleted.
>>>
>>> Deleting S2..n-000001 will wake up S3 though, which should now be
>>> looking at S1....n-0000000. Now you have two Solr listeners looking at
>>> the same ephemeral node. The key is that deleting S2...n-000001 does
>>> _not_ wake up S2, just any solr instance that has a watch on the
>>> associated ephemeral node.
>
> Thanks for the info Erick. I wasn't aware of this "linked-list" listeners
> structure between the zk nodes. Based on what you've said though I've
> changed my implementation a bit and it seems to be working at first glance.
> Of course it's not reliable yet but it looks promising.
>
> My original attempt
>> S1:-n_0000000000 (no code running here)
>> S2:-n_0000000004 (code deleting zknode -n_0000000001 and creating
>> -n_0000000004)
>> S3:-n_0000000003 (code deleting zknode -n_0000000002 and creating
>> -n_0000000003)
>
> has been changed to
> S1:-n_0000000000 (no code running here)
> S2:-n_0000000003 (code deleting zknode -n_0000000001 and creating
> -n_0000000003 using EPHEMERAL_SEQUENTIAL)
> S3:-n_0000000002 (no code running here)
>
> Once S1 is shutdown S3 becomes leader since it listens to S1 now according
> to what you've said
>
> The original reason I pursued this "minimize leadership changes" quest was
> that it _could_ lead to "data loss" in some scenarios. I'm not entirely sure
> though and you could correct me on this and but I'm explaining myself.
>
> If you have incoming indexing requests during a rolling restart, could there
> be a case during the "current leader shutdown" where the "leader-to-be-node"
> could not have the time to sync with the
> "current-leader-that-shut-downs-node" in which case everyone will now sync
> to the new leader thus missing some updates. I've seen an installation
> having different index sizes in each replica that deteriorated over time.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973p4179147.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers

Posted by Zisis Tachtsidis <zi...@runbox.com>.

Daniel Collins wrote
> Is it important where your leader is?  If you just want to minimize
> leadership changes during rolling re-start, then you could restart in the
> opposite order (S3, S2, S1).  That would give only 1 transition, but the
> end result would be a leader on S2 instead of S1 (not sure if that
> important to you or not).  I know its not a "fix", but it might be a
> workaround until the whole leadership moving is done?

I think that rolling restarting the machines in the opposite order
(S3,S2,S1) will result in S3 being the leader. It's a valid approach but
shouldn't I have to revert to the original order (S1,S2,S3) to achieve the
same result in the following rolling restart? This includes operational
costs and complexity that I want to avoid.


Erick Erickson wrote
>> Just skimming, but the problem here that I ran into was with the
>> listeners. Each _Solr_ instance out there is listening to one of the
>> ephemeral nodes (the "one in front"). So deleting a node does _not_
>> change which ephemeral node the associated Solr instance is listening
>> to.
>>
>> So, for instance, when you delete S2..n-000001 and re-add it, S2 is
>> still looking at S1....n-000000 and will continue looking at
>> S1...n-000000 until S1....n-000000 is deleted.
>>
>> Deleting S2..n-000001 will wake up S3 though, which should now be
>> looking at S1....n-0000000. Now you have two Solr listeners looking at
>> the same ephemeral node. The key is that deleting S2...n-000001 does
>> _not_ wake up S2, just any solr instance that has a watch on the
>> associated ephemeral node.

Thanks for the info Erick. I wasn't aware of this "linked-list" listeners
structure between the zk nodes. Based on what you've said though I've
changed my implementation a bit and it seems to be working at first glance.
Of course it's not reliable yet but it looks promising.

My original attempt
> S1:-n_0000000000 (no code running here)
> S2:-n_0000000004 (code deleting zknode -n_0000000001 and creating
> -n_0000000004)
> S3:-n_0000000003 (code deleting zknode -n_0000000002 and creating
> -n_0000000003) 

has been changed to 
S1:-n_0000000000 (no code running here)
S2:-n_0000000003 (code deleting zknode -n_0000000001 and creating
-n_0000000003 using EPHEMERAL_SEQUENTIAL)
S3:-n_0000000002 (no code running here) 

Once S1 is shutdown S3 becomes leader since it listens to S1 now according
to what you've said

The original reason I pursued this "minimize leadership changes" quest was
that it _could_ lead to "data loss" in some scenarios. I'm not entirely sure
though and you could correct me on this and but I'm explaining myself.

If you have incoming indexing requests during a rolling restart, could there
be a case during the "current leader shutdown" where the "leader-to-be-node"
could not have the time to sync with the
"current-leader-that-shut-downs-node" in which case everyone will now sync
to the new leader thus missing some updates. I've seen an installation
having different index sizes in each replica that deteriorated over time.




--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973p4179147.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers

Posted by Daniel Collins <da...@gmail.com>.

Is it important where your leader is?  If you just want to minimize
leadership changes during rolling re-start, then you could restart in the
opposite order (S3, S2, S1).  That would give only 1 transition, but the
end result would be a leader on S2 instead of S1 (not sure if that
important to you or not).  I know its not a "fix", but it might be a
workaround until the whole leadership moving is done?

On 12 January 2015 at 18:17, Erick Erickson <er...@gmail.com> wrote:

> Just skimming, but the problem here that I ran into was with the
> listeners. Each _Solr_ instance out there is listening to one of the
> ephemeral nodes (the "one in front"). So deleting a node does _not_
> change which ephemeral node the associated Solr instance is listening
> to.
>
> So, for instance, when you delete S2..n-000001 and re-add it, S2 is
> still looking at S1....n-000000 and will continue looking at
> S1...n-000000 until S1....n-000000 is deleted.
>
> Deleting S2..n-000001 will wake up S3 though, which should now be
> looking at S1....n-0000000. Now you have two Solr listeners looking at
> the same ephemeral node. The key is that deleting S2...n-000001 does
> _not_ wake up S2, just any solr instance that has a watch on the
> associated ephemeral node.
>
> The code you want is in LeaderElector.checkIfIamLeader to understand
> how it all works. Be aware that the sortSeqs call sorts the nodes by
> 1> sequence number
> 2> string comparison.
>
> Which has the unfortunate characteristic of a secondary sort by
> session ID. So two nodes with the same sequence number can sort before
> or after each other depending on which one gets a session higher/lower
> than the other.
>
> This is quite tricky to get right, I once created a patch for 4.10.3
> by applying things in this order (some minor tweaks required). All
> SOLR-
> 6115
> 6512
> 6577
> 6513
> 6517
> 6670
> 6691
>
> Good luck!
> Erick
>
>
>
>
> On Mon, Jan 12, 2015 at 8:54 AM, Zisis Tachtsidis <zi...@runbox.com>
> wrote:
> > SolrCloud uses ZooKeeper sequence flags to keep track of the order in
> which
> > nodes register themselves as leader candidates. The node with the lowest
> > sequence number wins as leader of the shard.
> >
> > What I'm trying to do is to keep the leader re-assignments to the minimum
> > during a rolling restart. In this direction I change the zk sequence
> numbers
> > on the SolrCloud nodes when all nodes of the cluster are up and active.
> I'm
> > using Solr 4.10.0 and I'm aware of SOLR-6491 which has a similar purpose
> but
> > I'm trying to do it from "outside", using the existing APIs without
> editing
> > Solr source code.
> >
> > == TYPICAL SCENARIO ==
> > Suppose we have 3 Solr instances S1,S2,S3. They are started in the same
> > order and the zk sequences assigned have as follows
> > S1:-n_0000000000 (LEADER)
> > S2:-n_0000000001
> > S3:-n_0000000002
> >
> > In a rolling restart we'll get S2 as leader (after S1 shutdown), then S3
> > (after S2 shutdown) and finally S1(after S3 shutdown), 3 changes in
> total.
> >
> > == MY ATTEMPT ==
> > By using SolrZkClient and the Zookeeper multi API  I found a way to get
> rid
> > of the old zknodes that participate in a shard's leader election and
> write
> > new ones where we can assign the sequence number of our liking.
> >
> > S1:-n_0000000000 (no code running here)
> > S2:-n_0000000004 (code deleting zknode -n_0000000001 and creating
> > -n_0000000004)
> > S3:-n_0000000003 (code deleting zknode -n_0000000002 and creating
> > -n_0000000003)
> >
> > In a rolling restart I'd expect to have S3 as leader (after S1
> shutdown), no
> > change (after S2 shutdown) and finally S1(after S3 shutdown), that is 2
> > changes. This will be constant no matter how many servers are added in
> > SolrCloud while in the first scenarion the # of re-assignments equals
> the #
> > of Solr servers.
> >
> > The problem occurs when S1 (LEADER) is shut down. The elections that take
> > place still set S2 as leader, It's like ignoring the new sequence
> numbers.
> > When I go to /solr/#/~cloud?view=tree the new sequence numbers are listed
> > under "/collections" based on which S3 should have become the leader.
> > Do you have any idea why the new state is not acknowledged during the
> > elections? Is something cached? Or to put it bluntly do I have any chance
> > down this path? If not what are my options? Is it possible to apply all
> > patches under SOLR-6491 in isolation and continue from there?
> >
> > Thank you.
> >
> > Extra info which might help follows
> > 1. Some logging related to leader elections after S1 has been shut down
> >     S2 - org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with
> > shard failed, moving to the next candidate
> >     S2 - org.apache.solr.cloud.ShardLeaderElectionContext We failed sync,
> > but we have no versions - we can't sync in that
> >            case - we were active before, so become leader anyway
> >
> >     S3 - org.apache.solr.cloud.LeaderElector Our node is no longer in
> line
> > to be leader
> >
> > 2. And some sample code on how I perform the ZK re-sequencing
> >    // Read current zk nodes for a specific collection
> >
> >
> solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().getChildren("/collections/core/leader_elect/shard1
> >       /election", true)
> >    // node deletion
> >       Op.delete(path, -1)
> >    // node creation
> >       Op.create(createPath, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE,
> > CreateMode.EPHEMERAL_SEQUENTIAL);
> >    // Perform operations
> >
> >
> solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().multi(opsList);
> >       solrServer.getZkStateReader().updateClusterState(true);
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers

Posted by Erick Erickson <er...@gmail.com>.

Just skimming, but the problem here that I ran into was with the
listeners. Each _Solr_ instance out there is listening to one of the
ephemeral nodes (the "one in front"). So deleting a node does _not_
change which ephemeral node the associated Solr instance is listening
to.

So, for instance, when you delete S2..n-000001 and re-add it, S2 is
still looking at S1....n-000000 and will continue looking at
S1...n-000000 until S1....n-000000 is deleted.

Deleting S2..n-000001 will wake up S3 though, which should now be
looking at S1....n-0000000. Now you have two Solr listeners looking at
the same ephemeral node. The key is that deleting S2...n-000001 does
_not_ wake up S2, just any solr instance that has a watch on the
associated ephemeral node.

The code you want is in LeaderElector.checkIfIamLeader to understand
how it all works. Be aware that the sortSeqs call sorts the nodes by
1> sequence number
2> string comparison.

Which has the unfortunate characteristic of a secondary sort by
session ID. So two nodes with the same sequence number can sort before
or after each other depending on which one gets a session higher/lower
than the other.

This is quite tricky to get right, I once created a patch for 4.10.3
by applying things in this order (some minor tweaks required). All
SOLR-
6115
6512
6577
6513
6517
6670
6691

Good luck!
Erick




On Mon, Jan 12, 2015 at 8:54 AM, Zisis Tachtsidis <zi...@runbox.com> wrote:
> SolrCloud uses ZooKeeper sequence flags to keep track of the order in which
> nodes register themselves as leader candidates. The node with the lowest
> sequence number wins as leader of the shard.
>
> What I'm trying to do is to keep the leader re-assignments to the minimum
> during a rolling restart. In this direction I change the zk sequence numbers
> on the SolrCloud nodes when all nodes of the cluster are up and active. I'm
> using Solr 4.10.0 and I'm aware of SOLR-6491 which has a similar purpose but
> I'm trying to do it from "outside", using the existing APIs without editing
> Solr source code.
>
> == TYPICAL SCENARIO ==
> Suppose we have 3 Solr instances S1,S2,S3. They are started in the same
> order and the zk sequences assigned have as follows
> S1:-n_0000000000 (LEADER)
> S2:-n_0000000001
> S3:-n_0000000002
>
> In a rolling restart we'll get S2 as leader (after S1 shutdown), then S3
> (after S2 shutdown) and finally S1(after S3 shutdown), 3 changes in total.
>
> == MY ATTEMPT ==
> By using SolrZkClient and the Zookeeper multi API  I found a way to get rid
> of the old zknodes that participate in a shard's leader election and write
> new ones where we can assign the sequence number of our liking.
>
> S1:-n_0000000000 (no code running here)
> S2:-n_0000000004 (code deleting zknode -n_0000000001 and creating
> -n_0000000004)
> S3:-n_0000000003 (code deleting zknode -n_0000000002 and creating
> -n_0000000003)
>
> In a rolling restart I'd expect to have S3 as leader (after S1 shutdown), no
> change (after S2 shutdown) and finally S1(after S3 shutdown), that is 2
> changes. This will be constant no matter how many servers are added in
> SolrCloud while in the first scenarion the # of re-assignments equals the #
> of Solr servers.
>
> The problem occurs when S1 (LEADER) is shut down. The elections that take
> place still set S2 as leader, It's like ignoring the new sequence numbers.
> When I go to /solr/#/~cloud?view=tree the new sequence numbers are listed
> under "/collections" based on which S3 should have become the leader.
> Do you have any idea why the new state is not acknowledged during the
> elections? Is something cached? Or to put it bluntly do I have any chance
> down this path? If not what are my options? Is it possible to apply all
> patches under SOLR-6491 in isolation and continue from there?
>
> Thank you.
>
> Extra info which might help follows
> 1. Some logging related to leader elections after S1 has been shut down
>     S2 - org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with
> shard failed, moving to the next candidate
>     S2 - org.apache.solr.cloud.ShardLeaderElectionContext We failed sync,
> but we have no versions - we can't sync in that
>            case - we were active before, so become leader anyway
>
>     S3 - org.apache.solr.cloud.LeaderElector Our node is no longer in line
> to be leader
>
> 2. And some sample code on how I perform the ZK re-sequencing
>    // Read current zk nodes for a specific collection
>
> solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().getChildren("/collections/core/leader_elect/shard1
>       /election", true)
>    // node deletion
>       Op.delete(path, -1)
>    // node creation
>       Op.create(createPath, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE,
> CreateMode.EPHEMERAL_SEQUENTIAL);
>    // Perform operations
>
> solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().multi(opsList);
>       solrServer.getZkStateReader().updateClusterState(true);
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973.html
> Sent from the Solr - User mailing list archive at Nabble.com.