You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by cwhi <ch...@gmail.com> on 2013/12/18 19:01:30 UTC

Shards stuck in "down" state after splitting shard - How can we recover from a failed SPLITSHARD?

I called SPLITSHARD on a shard in an existing SolrCloud instance, where the
shard had ~1 million documents in it.  It's been about 3 hours since that
splitting has completed, and the subshards are still stuck in a "Down"
state.  They are reported as down in localhost/solr/#/~cloud, and I'm unable
to query my index.

How can we recover from a failed SPLITSHARD operation?



--
View this message in context: http://lucene.472066.n3.nabble.com/Shards-stuck-in-down-state-after-splitting-shard-How-can-we-recover-from-a-failed-SPLITSHARD-tp4107297.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Shards stuck in "down" state after splitting shard - How can we recover from a failed SPLITSHARD?

Posted by cwhi <ch...@gmail.com>.

Thanks for your reply Anshum.  I took a look at clusterstate.json, and it
seems they are stuck in "construction" while the others are still active. 
I'm able to query my index again (that seems to have been an unrelated
issue), but I'd still like to remove these stuck shards and recreate them
(or fix the existing ones).



--
View this message in context: http://lucene.472066.n3.nabble.com/Shards-stuck-in-down-state-after-splitting-shard-How-can-we-recover-from-a-failed-SPLITSHARD-tp4107297p4107620.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Shards stuck in "down" state after splitting shard - How can we recover from a failed SPLITSHARD?

Posted by cwhi <ch...@gmail.com>.

Thanks again for your replies.  I'm using Solr 4.6.  I just tried splitting
another shard so I could grab the exceptions from the logs, and  here is the
log output. <http://pastebin.com/7uC5PQsa>    

I  noticed a few obvious exceptions that might have caused this to fail,
such as this:

ERROR - 2013-12-20 20:18:24.231; org.apache.solr.core.CoreContainer; Unable
to create core: collection1_shard3_1_replica1
java.lang.RuntimeException: java.io.IOException: Error opening
/configs/config1/stopwords.txt
	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:169)
	at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
	at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
	at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:254)
	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:590)
	at
org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:498)
	at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:152)
	at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
	at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:662)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
	at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
	at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
	at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
	at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
	at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
	at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
	at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
	at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
	at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
	at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
	at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at org.eclipse.jetty.server.Server.handle(Server.java:368)
	at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
	at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
	at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
	at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
	at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
	at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
	at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: Error opening /configs/config1/stopwords.txt
	at
org.apache.solr.cloud.ZkSolrResourceLoader.openResource(ZkSolrResourceLoader.java:83)
	at
org.apache.lucene.analysis.util.AbstractAnalysisFactory.getLines(AbstractAnalysisFactory.java:255)
	at
org.apache.lucene.analysis.util.AbstractAnalysisFactory.getWordSet(AbstractAnalysisFactory.java:243)
	at
org.apache.lucene.analysis.core.StopFilterFactory.inform(StopFilterFactory.java:99)
	at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:655)
	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:167)
	... 35 more


That exception claims that it can't read stopwords.txt, but the file is
definitely present locally at solr/conf/stopwords.txt, and it's present in
zookeeper at /configs/config1/stopwords.txt (I just checked with zkCli.cmd).




--
View this message in context: http://lucene.472066.n3.nabble.com/Shards-stuck-in-down-state-after-splitting-shard-How-can-we-recover-from-a-failed-SPLITSHARD-tp4107297p4107668.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Shards stuck in "down" state after splitting shard - How can we recover from a failed SPLITSHARD?

Posted by Anshum Gupta <an...@anshumgupta.net>.

Looking at this, it doesn't look like the operation completed. Also, the
parent shard seems to be intact and ideally should have served the results.
Until splitting and replication completes, the sub-shards don't go active
(and the parent shard doesn't go inactive).

Can you give me more information on this? What version of Solr are you
using?
Also, exceptions/messages from the logs would be required to get more
context.


On Fri, Dec 20, 2013 at 7:58 AM, cwhi <ch...@gmail.com> wrote:

> My apologies, I forgot to paste the output of clusterstate.json to my last
> post.  Here it is:
>
> [zk: localhost:2181(CONNECTED) 1] get /clusterstate.json
> {"collection1":{
>     "shards":{
>       "shard1":{
>         "range":"80000000-d554ffff",
>         "state":"active",
>         "replicas":{"10.0.0.229:8443_solr_collection1":{
>             "state":"active",
>             "base_url":"http://10.0.0.229:8443/solr",
>             "core":"collection1",
>             "node_name":"10.0.0.229:8443_solr",
>             "leader":"true"}}},
>       "shard2":{
>         "range":"d5550000-2aa9ffff",
>         "state":"active",
>         "replicas":{"10.0.0.5:8443_solr_collection1":{
>             "state":"active",
>             "base_url":"http://10.0.0.5:8443/solr",
>             "core":"collection1",
>             "node_name":"10.0.0.5:8443_solr",
>             "leader":"true"}}},
>       "shard3":{
>         "range":"2aaa0000-7fffffff",
>         "state":"active",
>         "replicas":{"10.0.0.246:8443_solr_collection1":{
>             "state":"active",
>             "base_url":"http://10.0.0.246:8443/solr",
>             "core":"collection1",
>             "node_name":"10.0.0.246:8443_solr",
>             "leader":"true"}}},
>       "shard1_0":{
>         "range":"80000000-aaa9ffff",
>         "state":"construction",
>         "parent":"shard1",
>         "replicas":{"10.0.0.229:8443_solr_collection1_shard1_0_replica1":{
>             "state":"down",
>             "base_url":"http://10.0.0.229:8443/solr",
>             "core":"collection1_shard1_0_replica1",
>             "node_name":"10.0.0.229:8443_solr"}}},
>       "shard1_1":{
>         "range":"aaaa0000-d554ffff",
>         "state":"construction",
>         "parent":"shard1",
>         "replicas":{"10.0.0.229:8443_solr_collection1_shard1_1_replica1":{
>             "state":"down",
>             "base_url":"http://10.0.0.229:8443/solr",
>             "core":"collection1_shard1_1_replica1",
>             "node_name":"10.0.0.229:8443_solr"}}},
>       "shard2_0":{
>         "range":"d5550000-fffeffff",
>         "state":"construction",
>         "parent":"shard2",
>         "replicas":{"10.0.0.5:8443_solr_collection1_shard2_0_replica1":{
>             "state":"down",
>             "base_url":"http://10.0.0.5:8443/solr",
>             "core":"collection1_shard2_0_replica1",
>             "node_name":"10.0.0.5:8443_solr",
>             "leader":"true"}}},
>       "shard2_1":{
>         "range":"ffff0000-2aa9ffff",
>         "state":"construction",
>         "parent":"shard2",
>         "replicas":{"10.0.0.5:8443_solr_collection1_shard2_1_replica1":{
>             "state":"down",
>             "base_url":"http://10.0.0.5:8443/solr",
>             "core":"collection1_shard2_1_replica1",
>             "node_name":"10.0.0.5:8443_solr",
>             "leader":"true"}}}},
>     "maxShardsPerNode":"1",
>     "router":{"name":"compositeId"},
>     "replicationFactor":"1",
>     "autoCreated":"true"}}
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Shards-stuck-in-down-state-after-splitting-shard-How-can-we-recover-from-a-failed-SPLITSHARD-tp4107297p4107622.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 

Anshum Gupta
http://www.anshumgupta.net

Re: Shards stuck in "down" state after splitting shard - How can we recover from a failed SPLITSHARD?

Posted by cwhi <ch...@gmail.com>.

My apologies, I forgot to paste the output of clusterstate.json to my last
post.  Here it is:

[zk: localhost:2181(CONNECTED) 1] get /clusterstate.json
{"collection1":{
    "shards":{
      "shard1":{
        "range":"80000000-d554ffff",
        "state":"active",
        "replicas":{"10.0.0.229:8443_solr_collection1":{
            "state":"active",
            "base_url":"http://10.0.0.229:8443/solr",
            "core":"collection1",
            "node_name":"10.0.0.229:8443_solr",
            "leader":"true"}}},
      "shard2":{
        "range":"d5550000-2aa9ffff",
        "state":"active",
        "replicas":{"10.0.0.5:8443_solr_collection1":{
            "state":"active",
            "base_url":"http://10.0.0.5:8443/solr",
            "core":"collection1",
            "node_name":"10.0.0.5:8443_solr",
            "leader":"true"}}},
      "shard3":{
        "range":"2aaa0000-7fffffff",
        "state":"active",
        "replicas":{"10.0.0.246:8443_solr_collection1":{
            "state":"active",
            "base_url":"http://10.0.0.246:8443/solr",
            "core":"collection1",
            "node_name":"10.0.0.246:8443_solr",
            "leader":"true"}}},
      "shard1_0":{
        "range":"80000000-aaa9ffff",
        "state":"construction",
        "parent":"shard1",
        "replicas":{"10.0.0.229:8443_solr_collection1_shard1_0_replica1":{
            "state":"down",
            "base_url":"http://10.0.0.229:8443/solr",
            "core":"collection1_shard1_0_replica1",
            "node_name":"10.0.0.229:8443_solr"}}},
      "shard1_1":{
        "range":"aaaa0000-d554ffff",
        "state":"construction",
        "parent":"shard1",
        "replicas":{"10.0.0.229:8443_solr_collection1_shard1_1_replica1":{
            "state":"down",
            "base_url":"http://10.0.0.229:8443/solr",
            "core":"collection1_shard1_1_replica1",
            "node_name":"10.0.0.229:8443_solr"}}},
      "shard2_0":{
        "range":"d5550000-fffeffff",
        "state":"construction",
        "parent":"shard2",
        "replicas":{"10.0.0.5:8443_solr_collection1_shard2_0_replica1":{
            "state":"down",
            "base_url":"http://10.0.0.5:8443/solr",
            "core":"collection1_shard2_0_replica1",
            "node_name":"10.0.0.5:8443_solr",
            "leader":"true"}}},
      "shard2_1":{
        "range":"ffff0000-2aa9ffff",
        "state":"construction",
        "parent":"shard2",
        "replicas":{"10.0.0.5:8443_solr_collection1_shard2_1_replica1":{
            "state":"down",
            "base_url":"http://10.0.0.5:8443/solr",
            "core":"collection1_shard2_1_replica1",
            "node_name":"10.0.0.5:8443_solr",
            "leader":"true"}}}},
    "maxShardsPerNode":"1",
    "router":{"name":"compositeId"},
    "replicationFactor":"1",
    "autoCreated":"true"}}



--
View this message in context: http://lucene.472066.n3.nabble.com/Shards-stuck-in-down-state-after-splitting-shard-How-can-we-recover-from-a-failed-SPLITSHARD-tp4107297p4107622.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Shards stuck in "down" state after splitting shard - How can we recover from a failed SPLITSHARD?

Posted by Anshum Gupta <an...@anshumgupta.net>.

Hi,

Is the parent shard currently active? What does the clusterstate.json say?
The subshard could be stuck in down when it's trying to recover but as far
as I remember, the sub-shards only get marked active (and the parent goes
inactive) once the recovery and replication (for as many replicas as the
parent shard) are completed.

On Wed, Dec 18, 2013 at 10:01 AM, cwhi <ch...@gmail.com> wrote:

> I called SPLITSHARD on a shard in an existing SolrCloud instance, where the
> shard had ~1 million documents in it.  It's been about 3 hours since that
> splitting has completed, and the subshards are still stuck in a "Down"
> state.  They are reported as down in localhost/solr/#/~cloud, and I'm
> unable
> to query my index.
>
> How can we recover from a failed SPLITSHARD operation?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Shards-stuck-in-down-state-after-splitting-shard-How-can-we-recover-from-a-failed-SPLITSHARD-tp4107297.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 

Anshum Gupta
http://www.anshumgupta.net