You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "anand.mahajan" <an...@zerebral.co.in> on 2015/01/20 16:05:35 UTC

Leaders in Recovery Failed state

Hi all,I have a cluster with 36 Shards and 3 replica per shard. I had to
recently restart the entire cluster - most of the shards & replica are back
up - but a few shards have not had any leaders for a long long time (close
to 18 hours now) - I tried reloading these cores and even the servlet
containers hosting these cores. Its only now that all the shards have
leaders allocated - but few of these Leaders are still shown as Recovery
Failed status on the Solr Cloud tree view.I see the following in the logs
for these shards - INFO  - 2015-01-20 14:38:19.797;
org.apache.solr.handler.admin.CoreAdminHandler; In WaitForState(recovering):
collection=collection1, shard=shard1, thisCore=collection1_shard1_replica3,
leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true,
currentState=recovering, localState=recovery_failed,
nodeName=10.68.77.9:8983_solr, coreNodeName=core_node2,
onlyIfActiveCheckResult=true, nodeProps:
core_node2:{"state":"recovering","core":"collection1_shard1_replica1","node_name":"10.68.77.9:8983_solr","base_url":"http://10.68.77.9:8983/solr"}And
on other server hosting the replica for this shard - ERROR - 2015-01-20
14:38:20.768; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for shard3 in collection1 on 10.68.77.9:8983_solr but I still do
not see the requested state. I see state: recovering live:true leader from
ZK: http://10.68.77.3:8983/solr/collection1_shard3_replica3/	at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:999)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:245)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)	at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)	at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)	at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)I see that there is no replica
catch-up going on between any of these servers now. Couple of questions - 1.
What is it that the Solr cloud is waiting on to allocate the leaders for
such shards?2. Why are few of these shards show leaders in Recovery Failed
state? And how do I recover such shards?Thanks,Anand



--
View this message in context: http://lucene.472066.n3.nabble.com/Leaders-in-Recovery-Failed-state-tp4180610.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Leaders in Recovery Failed state

Posted by "anand.mahajan" <an...@zerebral.co.in>.

Erick Erickson <erickerickson <at> gmail.com> writes:

> 
> What version of Solr?
> 
> On Tue, Jan 20, 2015 at 7:07 AM, anand.mahajan <anand <at> zerebral.co.in>
> wrote:
> > Hi all,
> >
> >
> > I have a cluster with 36 Shards and 3 replica per shard. I had to
> recently
> > restart the entire cluster - most of the shards & replica are back up -
> but
> > a few shards have not had any leaders for a long long time (close to 18
> > hours now) - I tried reloading these cores and even the servlet
> containers
> > hosting these cores. Its only now that all the shards have leaders
> allocated
> > - but few of these Leaders are still shown as Recovery Failed status on
> the
> > Solr Cloud tree view.
> >
> >
> > I see the following in the logs for these shards -
> > INFO  - 2015-01-20 14:38:19.797;
> > org.apache.solr.handler.admin.CoreAdminHandler; In
> WaitForState(recovering):
> > collection=collection1, shard=shard1,
> thisCore=collection1_shard1_replica3,
> > leaderDoesNotNeedRecovery=false, isLeader? true, live=true,
> checkLive=true,
> > currentState=recovering, localState=recovery_failed,
> > nodeName=10.68.77.9:8983_solr, coreNodeName=core_node2,
> > onlyIfActiveCheckResult=true, nodeProps:
> >
> core_node2:{"state":"recovering","core":"collection1_shard1_replica1","node_name":"10.68.77.9:8983_solr","base_url":"http://10.68.77.9:8983/solr"}
> >
> >
> > And on other server hosting the replica for this shard -
> > ERROR - 2015-01-20 14:38:20.768; org.apache.solr.common.SolrException;
> > org.apache.solr.common.SolrException: I was asked to wait on state
> > recovering for shard3 in collection1 on 10.68.77.9:8983_solr but I still
> do
> > not see the requested state. I see state: recovering live:true leader
> from
> > ZK: http://10.68.77.3:8983/solr/collection1_shard3_replica3/
> >         at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:999)
> >         at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:245)
> >         at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
> >         at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
> >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
> >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> >         at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> >         at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> >         at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >         at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> >         at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >         at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> >         at
> >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> >         at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> >         at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> >         at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> >         at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> >         at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> >         at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> >         at org.eclipse.jetty.server.Server.handle(Server.java:368)
> >         at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> >         at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> >         at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
> >         at
> >
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
> >         at
> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
> >         at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> >         at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> >         at
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> >         at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> >         at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> >         at java.lang.Thread.run(Unknown Source)
> >
> >
> > I see that there is no replica catch-up going on between any of these
> > servers now.
> > Couple of questions -
> > 1. What is it that the Solr cloud is waiting on to allocate the leaders
> for
> > such shards?
> > 2. Why are few of these shards show leaders in Recovery Failed state?
> And
> > how do I recover such shards?
> >
> > Thanks,
> > Anand
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Leaders-in-Recovery-Failed-state-tp4180611.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 


Hi Eric, Sorry I did not reply earlier. I see this page cached here - on
gmane.org but the original post I posted on Solr Users list does not show
your comment -
http://lucene.472066.n3.nabble.com/Leaders-in-Recovery-Failed-state-td4180610.html

I'm on Solr 4.10.1  - The last time this had happened I removed the replica
for the affected Shards (the shards where the Leaders were shown as Down) -
deleted the Replica data directories and then added the replica back using
the Collections API - the did the trick then (but I'n not sure if that was
the right way to do it). Also the problem seemed to have rooted from the
fact that the Zookeeper instances were on the same machines as the Solr
servlet containers and perhaps the Zookeeper instances were starved of
resource (CPU & disk) - I have had since moved the Zookeeper instances out
to separate servers and that makes the boot time fast - but not all shards
come online when all the solr cloud instances are reboot. A few servers from
the Solr Cluster went down again and I have the same issues where for 3
shards the Leaders are shown as down and the logs in the log files for these
instances as below - 

INFO  - 2015-02-09 05:18:13.696;
org.apache.solr.handler.admin.CoreAdminHandler; In WaitForState(recovering):
collection=collection1, shard=shard10,
thisCore=collection1_shard10_replica2, leaderDoesNotNeedRecovery=false,
isLeader? true, live=true, checkLive=true, currentState=recovering,
localState=down, nodeName=10.68.77.8:8983_solr, coreNodeName=core_node28,
onlyIfActiveCheckResult=true, nodeProps:
core_node28:{"state":"recovering","core":"collection1_shard10_replica1","node_name":"10.68.77.8:8983_solr","base_url":"http://10.68.77.8:8983/solr"}

I have tried deleting the replica for these shards - but this time the
Delete Replica Async requests are showing in "submitted" state for very long
now (over 3 hours) - last time when I did this these requests finished
fairly quickly.

Any pointers are greatly appreciated.

Thanks,
Anand




--
View this message in context: http://lucene.472066.n3.nabble.com/Leaders-in-Recovery-Failed-state-tp4180610p4184943.html
Sent from the Solr - User mailing list archive at Nabble.com.