You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jon Hawkesworth <jo...@MEDQUIST.onmicrosoft.com> on 2016/08/25 21:00:53 UTC

solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Anyone got any suggestions how I can fix up my solrcloud 6.0.1 replica remains down issue?

Today we stopped all the loading and querying, brought down all 4 solr nodes, went into zookeeper and deleted everything under /collections/transcribedReports/leader_initiated_recovery/shard1/ and brought the cluster back up (this seeming to be a reasonably similar situation to https://issues.apache.org/jira/browse/SOLR-7021 where this workaround is described, albeit for an older version of solr.

After a while things looked ok but when we attempted to move the second replica back to the original node (by creating a third and then deleting the temp one which wasn't on the node we wanted it on), we immediately got a 'down' status on the node (and its stayed that way ever since), with ' Could not publish as ACTIVE after succesful recovery ' messages appearing in the logs

Its as if there is something specifically wrong with that node that stops us from ever having a functioning replica of shard1 on it.

weird thing is shard2 on the same (problematic) node seems fine.

Other stuff we have tried includes

issuing a REQUESTRECOVERY
moving from 2 to 4 nodes
adding more replicas on other nodes (new replicas immediately go into down state and stay that way).

System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection with 4 shards and and I'm trying to have 2 replicas on each of the 4 nodes.
Currently each shard is managing approx 1.2 million docs (mostly just text 10-20k in size each usually).

Any suggestions would be gratefully appreciated.

Many thanks,

Jon


Jon Hawkesworth
Software Developer


[cid:image002.png@01D1FF1C.25E8DC80]

Hanley Road, Malvern, WR13 6NP. UK
O: +44 (0) 1684 312313
jon.hawkesworth@mmodal.com
www.mmodal.com<http://www.medquist.com/>

This electronic mail transmission contains confidential information intended only for the person(s) named. Any use, distribution, copying or disclosure by another person is strictly prohibited. If you are not the intended recipient of this e-mail, promptly delete it and all attachments.


Re: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Posted by Erick Erickson <er...@gmail.com>.
This is odd. The ADDREPLICA _should_ be immediately listed as "down", but
should shortly go to
"recovering"and then "active". The transition to "active" may take a while
as the index has to be
copied from the leader, but you shouldn't be stuck at "down" for very long.

Take a look at the Solr logs for both the leader of the shard and the
replica you're trying to add. They
often have more complete and helpful error messages...

Also note that you occasionally have to be patient. For instance, there's a
3 minute wait period for
leader election at times. It sounds, though, like things aren't getting
better for far longer than 3 minutes.

Best,
Erick

On Thu, Aug 25, 2016 at 2:00 PM, Jon Hawkesworth <
jon.hawkesworth@medquist.onmicrosoft.com> wrote:

> Anyone got any suggestions how I can fix up my solrcloud 6.0.1 replica
> remains down issue?
>
>
>
> Today we stopped all the loading and querying, brought down all 4 solr
> nodes, went into zookeeper and deleted everything under /collections/
> transcribedReports/leader_initiated_recovery/shard1/ and brought the
> cluster back up (this seeming to be a reasonably similar situation to
> https://issues.apache.org/jira/browse/SOLR-7021 where this workaround is
> described, albeit for an older version of solr.
>
>
>
> After a while things looked ok but when we attempted to move the second
> replica back to the original node (by creating a third and then deleting
> the temp one which wasn't on the node we wanted it on), we immediately got
> a 'down' status on the node (and its stayed that way ever since), with ' Could
> not publish as ACTIVE after succesful recovery ' messages appearing in
> the logs
>
>
>
> Its as if there is something specifically wrong with that node that stops
> us from ever having a functioning replica of shard1 on it.
>
>
>
> weird thing is shard2 on the same (problematic) node seems fine.
>
>
>
> Other stuff we have tried includes
>
>
>
> issuing a REQUESTRECOVERY
>
> moving from 2 to 4 nodes
>
> adding more replicas on other nodes (new replicas immediately go into down
> state and stay that way).
>
>
>
> System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection with 4
> shards and and I'm trying to have 2 replicas on each of the 4 nodes.
>
> Currently each shard is managing approx 1.2 million docs (mostly just text
> 10-20k in size each usually).
>
>
>
> Any suggestions would be gratefully appreciated.
>
>
>
> Many thanks,
>
>
>
> Jon
>
>
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkesworth@mmodal.com <jo...@mmodal.com> www.mmodal.com
> <http://www.medquist.com/>*
>
>
>
> *This electronic mail transmission contains confidential information
> intended only for the person(s) named. Any use, distribution, copying or
> disclosure by another person is strictly prohibited. If you are not the
> intended recipient of this e-mail, promptly delete it and all attachments.*
>
>
>