You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Roberts <mr...@tableau.com> on 2015/06/05 01:31:03 UTC

Peer Sync fails when newly added node is elected leader.

Hi,

I am seeing some unexpected behavior when adding a new machine to my cluster. I am running 4.10.3.

My setup has multiple collections, each collection has a single shard. I am using core auto discovery on the hosts (my deployment mechanism ensures that the directory structure is created and the core.properties file is in the right place).

To add a new machine I have to stop the cluster.

If I add a new machine, and start the cluster, if this new machine is elected leader for the shard, peer recovery fails. So, now I have a leader with no content, and replicas with content. Depending on where the read request is sent, I may or may not get the response I am expecting.

2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader process for shard shard1
2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue.
2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync
2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.update.PeerSync - PeerSync: core=domain url=http://10.36.9.70:11000/solr START replicas=[http://mlim:11000/solr/domain/] nUpdates=100
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.update.PeerSync - PeerSync: core=domain url=http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: http://10.36.9.70:11000/solr/domain/ shard1
2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain baseURL=http://10.36.9.70:11000/solr
2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary

This seems like a fairly common scenario. So I suspect, either I am doing something incorrectly, or I have an incorrect assumption about how this is supposed to work.

Does anyone have any suggestions?

Thanks

Mike.

Re: Peer Sync fails when newly added node is elected leader.

Posted by Erick Erickson <er...@gmail.com>.

And to pile on Shalin's comments, there is absolutely no reason
to try to pre-configure the replica on the new node, and quite
a bit of downside as you are finding. Just add the new node
without any cores and use the ADDREPLICA command to cause
create replicas.

Best,
Erick

On Thu, Jun 4, 2015 at 8:31 PM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> Why do you stop the cluster while adding a node? This is the reason why
> this is happening. When the first node of a solr cluster starts up, it
> waits for some time to see other nodes but if it finds none then it goes
> ahead and becomes the leader. If other nodes were up and running then peer
> sync and replication recovery will make sure that the node with data
> becomes the leader. So just keep the cluster running while adding a new
> node.
>
> Also, stop relying on core discovery for setting up a node. At some point
> we will stop supporting this feature. Use the collection API to add new
> replicas.
>
> On Fri, Jun 5, 2015 at 5:01 AM, Michael Roberts <mr...@tableau.com>
> wrote:
>
>> Hi,
>>
>> I am seeing some unexpected behavior when adding a new machine to my
>> cluster. I am running 4.10.3.
>>
>> My setup has multiple collections, each collection has a single shard. I
>> am using core auto discovery on the hosts (my deployment mechanism ensures
>> that the directory structure is created and the core.properties file is in
>> the right place).
>>
>> To add a new machine I have to stop the cluster.
>>
>> If I add a new machine, and start the cluster, if this new machine is
>> elected leader for the shard, peer recovery fails. So, now I have a leader
>> with no content, and replicas with content. Depending on where the read
>> request is sent, I may or may not get the response I am expecting.
>>
>> 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader
>> process for shard shard1
>> 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see
>> more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to
>> continue.
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader
>> - try and sync
>> 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr START replicas=[
>> http://mlim:11000/solr/domain/] nUpdates=100
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we
>> have no versions - we can't sync in that case - we were active before, so
>> become leader anyway
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader:
>> http://10.36.9.70:11000/solr/domain/ shard1
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain
>> baseURL=http://10.36.9.70:11000/solr
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary
>>
>> This seems like a fairly common scenario. So I suspect, either I am doing
>> something incorrectly, or I have an incorrect assumption about how this is
>> supposed to work.
>>
>> Does anyone have any suggestions?
>>
>> Thanks
>>
>> Mike.
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.

Re: Peer Sync fails when newly added node is elected leader.

Posted by Michael Roberts <mr...@tableau.com>.

Thanks, that was the response I was expecting unfortunately.

We have to stop the cluster to add a node, because Solr is part of a larger system and we don’t support either partial shutdown, or dynamic addition within the larger system.

“it waits for some time to see other nodes but if it finds none then it goes ahead and becomes the leader.”

That is not what I am seeing happen though. In my example, I had two machines, A (which had been running previously), B (which was newly added). Both A & B participated in the election, and B was elected. It wasn’t a case of just B was available. It would seem that B shouldn’t be elected when there was a better candidate (A), or that if elected B should ensure it’s caught up to it’s peers before marking itself as active.

On 6/4/15, 8:31 PM, "Shalin Shekhar Mangar" <sh...@gmail.com> wrote:



>Why do you stop the cluster while adding a node? This is the reason why
>this is happening. When the first node of a solr cluster starts up, it
>waits for some time to see other nodes but if it finds none then it goes
>ahead and becomes the leader. If other nodes were up and running then peer
>sync and replication recovery will make sure that the node with data
>becomes the leader. So just keep the cluster running while adding a new
>node.
>
>Also, stop relying on core discovery for setting up a node. At some point
>we will stop supporting this feature. Use the collection API to add new
>replicas.
>
>On Fri, Jun 5, 2015 at 5:01 AM, Michael Roberts <mr...@tableau.com>
>wrote:
>
>> Hi,
>>
>> I am seeing some unexpected behavior when adding a new machine to my
>> cluster. I am running 4.10.3.
>>
>> My setup has multiple collections, each collection has a single shard. I
>> am using core auto discovery on the hosts (my deployment mechanism ensures
>> that the directory structure is created and the core.properties file is in
>> the right place).
>>
>> To add a new machine I have to stop the cluster.
>>
>> If I add a new machine, and start the cluster, if this new machine is
>> elected leader for the shard, peer recovery fails. So, now I have a leader
>> with no content, and replicas with content. Depending on where the read
>> request is sent, I may or may not get the response I am expecting.
>>
>> 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader
>> process for shard shard1
>> 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see
>> more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to
>> continue.
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader
>> - try and sync
>> 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr START replicas=[
>> http://mlim:11000/solr/domain/] nUpdates=100
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we
>> have no versions - we can't sync in that case - we were active before, so
>> become leader anyway
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader:
>> http://10.36.9.70:11000/solr/domain/ shard1
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain
>> baseURL=http://10.36.9.70:11000/solr
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary
>>
>> This seems like a fairly common scenario. So I suspect, either I am doing
>> something incorrectly, or I have an incorrect assumption about how this is
>> supposed to work.
>>
>> Does anyone have any suggestions?
>>
>> Thanks
>>
>> Mike.
>>
>
>
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

Re: Peer Sync fails when newly added node is elected leader.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Why do you stop the cluster while adding a node? This is the reason why
this is happening. When the first node of a solr cluster starts up, it
waits for some time to see other nodes but if it finds none then it goes
ahead and becomes the leader. If other nodes were up and running then peer
sync and replication recovery will make sure that the node with data
becomes the leader. So just keep the cluster running while adding a new
node.

Also, stop relying on core discovery for setting up a node. At some point
we will stop supporting this feature. Use the collection API to add new
replicas.

On Fri, Jun 5, 2015 at 5:01 AM, Michael Roberts <mr...@tableau.com>
wrote:

> Hi,
>
> I am seeing some unexpected behavior when adding a new machine to my
> cluster. I am running 4.10.3.
>
> My setup has multiple collections, each collection has a single shard. I
> am using core auto discovery on the hosts (my deployment mechanism ensures
> that the directory structure is created and the core.properties file is in
> the right place).
>
> To add a new machine I have to stop the cluster.
>
> If I add a new machine, and start the cluster, if this new machine is
> elected leader for the shard, peer recovery fails. So, now I have a leader
> with no content, and replicas with content. Depending on where the read
> request is sent, I may or may not get the response I am expecting.
>
> 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader
> process for shard shard1
> 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO
> org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see
> more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to
> continue.
> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader
> - try and sync
> 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
> http://10.36.9.70:11000/solr START replicas=[
> http://mlim:11000/solr/domain/] nUpdates=100
> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
> http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we
> have no versions - we can't sync in that case - we were active before, so
> become leader anyway
> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader:
> http://10.36.9.70:11000/solr/domain/ shard1
> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain
> baseURL=http://10.36.9.70:11000/solr
> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
> org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary
>
> This seems like a fairly common scenario. So I suspect, either I am doing
> something incorrectly, or I have an incorrect assumption about how this is
> supposed to work.
>
> Does anyone have any suggestions?
>
> Thanks
>
> Mike.
>



-- 
Regards,
Shalin Shekhar Mangar.