You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Roberts <mr...@tableau.com> on 2015/05/26 23:46:20 UTC
Sync failure after shard leader election when adding new replica.
Hi,
I have a SolrCloud setup, running 4.10.3. The setup consists of several cores, each with a single shard and initially each shard has a single replica (so, basically, one machine). I am using core discovery, and my deployment tools create an empty core on newly provisioned machines.
The scenario that I am testing is, Machine 1 is running and writes are occurring from my application to Solr. At some point, I stop Machine 1, and reconfigure my application to add Machine 2. Both machines are then started.
What I would expect to happen at this point, is Machine 2 cannot become leader because it is behind compared to Machine 1. Machine 2 would then restore from Machine 1.
However, looking at the logs. I am seeing Machine 2 become elected leader and fail the PeerRestore
2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue.
2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync
2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr START replicas=[http://jchar-1:11000/solr/project/] nUpdates=100
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr DONE. We have no versions. sync failed.
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: http://10.32.132.64:11000/solr/project/ shard1
What is the expected behavior here? What’s the best practice for adding a new replica? Should I have the SolrCloud running and do it via the Collections API or can I continue to use core discovery?
Thanks.
Re: Sync failure after shard leader election when adding new replica.
Posted by Erick Erickson <er...@gmail.com>.
Please, please, please do _not_ try to use core discovery to add new
replicas by manually editing stuff.
bq: and my deployment tools create an empty core on newly provisioned machines.
This is a really bad idea (as you have discovered). Basically, your
deployment tools have to do everything right to get this to "play
nice" with SolrCloud. Your core names can't conflict. You have to
spell all the parameters in core.properties right. Etc. There are
endless places to go wrong. And this is all done for you (and tested
with unit tests) via the Collections API.
Assuming that in your scenario you started machine2 before machine1,
how would Solr have any clue that that machine1 would _ever_ come back
up? It'll do the best it can and try to elect a leader, but there's
only one machine to choose from... and it's sorely out of date....
Absolutely use the collections api to add replicas to running
SolrCloud clusters. And adding a replica via the Collections API
_will_ use core discovery, as in it'll cause a core.properties file to
be written on the node in question, populate it with all the necessary
parameters, initiate a synch from the (running) leader, put itself
into the query rotation automatically when the sync is done etc. All
without you
1> having to try to figure all this out yourself
2> take the collection offline
Best,
Erick
On Tue, May 26, 2015 at 2:46 PM, Michael Roberts <mr...@tableau.com> wrote:
> Hi,
>
> I have a SolrCloud setup, running 4.10.3. The setup consists of several cores, each with a single shard and initially each shard has a single replica (so, basically, one machine). I am using core discovery, and my deployment tools create an empty core on newly provisioned machines.
>
> The scenario that I am testing is, Machine 1 is running and writes are occurring from my application to Solr. At some point, I stop Machine 1, and reconfigure my application to add Machine 2. Both machines are then started.
>
> What I would expect to happen at this point, is Machine 2 cannot become leader because it is behind compared to Machine 1. Machine 2 would then restore from Machine 1.
>
> However, looking at the logs. I am seeing Machine 2 become elected leader and fail the PeerRestore
>
> 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue.
> 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync
> 2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr START replicas=[http://jchar-1:11000/solr/project/] nUpdates=100
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr DONE. We have no versions. sync failed.
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway
> 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: http://10.32.132.64:11000/solr/project/ shard1
>
> What is the expected behavior here? What’s the best practice for adding a new replica? Should I have the SolrCloud running and do it via the Collections API or can I continue to use core discovery?
>
> Thanks.
>
>