You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Roberts <mr...@tableau.com> on 2015/05/26 23:46:20 UTC

Sync failure after shard leader election when adding new replica.

Hi,

I have a SolrCloud setup, running 4.10.3. The setup consists of several cores, each with a single shard and initially each shard has a single replica (so, basically, one machine). I am using core discovery, and my deployment tools create an empty core on newly provisioned machines.

The scenario that I am testing is, Machine 1 is running and writes are occurring from my application to Solr. At some point, I stop Machine 1, and reconfigure my application to add Machine 2. Both machines are then started.

What I would expect to happen at this point, is Machine 2 cannot become leader because it is behind compared to Machine 1. Machine 2 would then restore from Machine 1.

However, looking at the logs. I am seeing Machine 2 become elected leader and fail the PeerRestore

2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue.
2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync
2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr START replicas=[http://jchar-1:11000/solr/project/] nUpdates=100
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr DONE.  We have no versions.  sync failed.
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: http://10.32.132.64:11000/solr/project/ shard1

What is the expected behavior here? What’s the best practice for adding a new replica? Should I have the SolrCloud running and do it via the Collections API or can I continue to use core discovery?

Thanks.

Re: Sync failure after shard leader election when adding new replica.

Posted by Erick Erickson <er...@gmail.com>.

Please, please, please do _not_ try to use core discovery to add new
replicas by manually editing stuff.

bq: and my deployment tools create an empty core on newly provisioned machines.

This is a really bad idea (as you have discovered).  Basically, your
deployment tools have to do everything right to get this to "play
nice" with SolrCloud. Your core names can't conflict. You have to
spell all the parameters in core.properties right. Etc. There are
endless places to go wrong. And this is all done for you (and tested
with unit tests) via the Collections API.

Assuming that in your scenario you started machine2 before machine1,
how would Solr have any clue that that machine1 would _ever_ come back
up? It'll do the best it can and try to elect a leader, but there's
only one machine to choose from... and it's sorely out of date....

Absolutely use the collections api to add replicas to running
SolrCloud clusters. And adding a replica via the Collections API
_will_ use core discovery, as in it'll cause a core.properties file to
be written on the node in question, populate it with all the necessary
parameters, initiate a synch from the (running) leader, put itself
into the query rotation automatically when the sync is done etc. All
without you
1> having to try to figure all this out yourself
2> take the collection offline

Best,
Erick

On Tue, May 26, 2015 at 2:46 PM, Michael Roberts <mr...@tableau.com> wrote:
> Hi,
>
> I have a SolrCloud setup, running 4.10.3. The setup consists of several cores, each with a single shard and initially each shard has a single replica (so, basically, one machine). I am using core discovery, and my deployment tools create an empty core on newly provisioned machines.
>
> The scenario that I am testing is, Machine 1 is running and writes are occurring from my application to Solr. At some point, I stop Machine 1, and reconfigure my application to add Machine 2. Both machines are then started.
>
> What I would expect to happen at this point, is Machine 2 cannot become leader because it is behind compared to Machine 1. Machine 2 would then restore from Machine 1.
>
> However, looking at the logs. I am seeing Machine 2 become elected leader and fail the PeerRestore
>
> 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue.
> 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync
> 2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr START replicas=[http://jchar-1:11000/solr/project/] nUpdates=100
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr DONE.  We have no versions.  sync failed.
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: http://10.32.132.64:11000/solr/project/ shard1
>
> What is the expected behavior here? What’s the best practice for adding a new replica? Should I have the SolrCloud running and do it via the Collections API or can I continue to use core discovery?
>
> Thanks.
>
>