You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by jimtronic <ji...@gmail.com> on 2016/10/24 17:23:13 UTC

Solr Cloud A/B Deployment Issue

We are running into a timing issue when trying to do a scripted deployment of
our Solr Cloud cluster.

Scenario to reproduce (sometimes):

1. launch 3 clean solr nodes connected to zookeeper.
2. create a 1 shard collection with replicas on each node.
3. load data (more will make the problem worse)
4. launch 3 more nodes
5. add replicas to each new node
6. once entire cluster is healthy, start killing first three nodes.

Depending on the timing, the second three nodes end up all in RECOVERING
state without a leader.  

This appears to be happening because when the first leader dies, all the new
nodes go into full replication recovery and if all the old boxes happen to
die during that state, the boxes are stuck. The boxes cannot serve requests
and they eventually (1-8 hours) go into RECOVERY_FAILED state. 

This state is easy to fix with a FORCELEADER call to the collections API,
but that's only remediation, not prevention.

My question is this: Why do the new nodes have to go into full replication
recovery when they are already up to date? I just added the replica, so it
shouldn't have to a new full replication again.

Jim




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud A/B Deployment Issue

Posted by jimtronic <ji...@gmail.com>.
Also, if we issue a delete by query where the query is "_version_:0", it also
creates a transaction log and then has no trouble transferring leadership
between old and new nodes.

Still, it seems like when we ADDREPLICA, some sort of transaction log should
be started. 

Jim



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4302959.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud A/B Deployment Issue

Posted by jimtronic <ji...@gmail.com>.
It appears this has all been resolved by the following ticket:

https://issues.apache.org/jira/browse/SOLR-9446

My scenario fails in 6.2.1, but works in 6.3 and Master where this bug has
been fixed.

In the meantime, we can use our workaround to issue a simple delete command
that deletes a non-existent document.

Jim



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4303210.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud A/B Deployment Issue

Posted by jimtronic <ji...@gmail.com>.
Interestingly, If I simply add one document to the full cluster after all 6
nodes are active, this entire problem goes away. This appears to be because
a transaction log entry is created which in turn prevents the new nodes from
going into full replication recovery upon leader change.

Adding a document is a hacky solution, however. It seems like new nodes that
were added via ADDREPLICA should know more about versions than they
currently do.





--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4302949.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud A/B Deployment Issue

Posted by Pushkar Raste <pu...@gmail.com>.
This is due to leader initiated recovery. When Take a look at

https://issues.apache.org/jira/browse/SOLR-9446

On Oct 24, 2016 1:23 PM, "jimtronic" <ji...@gmail.com> wrote:

> We are running into a timing issue when trying to do a scripted deployment
> of
> our Solr Cloud cluster.
>
> Scenario to reproduce (sometimes):
>
> 1. launch 3 clean solr nodes connected to zookeeper.
> 2. create a 1 shard collection with replicas on each node.
> 3. load data (more will make the problem worse)
> 4. launch 3 more nodes
> 5. add replicas to each new node
> 6. once entire cluster is healthy, start killing first three nodes.
>
> Depending on the timing, the second three nodes end up all in RECOVERING
> state without a leader.
>
> This appears to be happening because when the first leader dies, all the
> new
> nodes go into full replication recovery and if all the old boxes happen to
> die during that state, the boxes are stuck. The boxes cannot serve requests
> and they eventually (1-8 hours) go into RECOVERY_FAILED state.
>
> This state is easy to fix with a FORCELEADER call to the collections API,
> but that's only remediation, not prevention.
>
> My question is this: Why do the new nodes have to go into full replication
> recovery when they are already up to date? I just added the replica, so it
> shouldn't have to a new full replication again.
>
> Jim
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr Cloud A/B Deployment Issue

Posted by jimtronic <ji...@gmail.com>.
Great. Thanks for the work on this patch!

Jim



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4303357.html
Sent from the Solr - User mailing list archive at Nabble.com.