You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Greg Preston <gp...@marinsoftware.com> on 2014/01/28 18:31:56 UTC

Dead node, but clusterstate.json says active, won't sync on restart

** Using solrcloud 4.4.0 **

I had to kill a running solrcloud node.  There is still a replica for that
shard, so everything is functional.  We've done some indexing while the
node was killed.

I'd like to bring back up the downed node and have it resync from the other
replica.  But when I restart the downed node, it joins back up as active
immediately, and doesn't resync.  I even wiped the data directory on the
downed node, hoping that would force it to sync on restart, but it doesn't.

I'm assuming this is related to the state still being listed as active in
clusterstate.json for the downed node?  Since it comes back as active, it's
serving queries and giving old results.

How can I force this node to do a recovery on restart?

Thanks.


-Greg

Re: Dead node, but clusterstate.json says active, won't sync on restart

Posted by Greg Preston <gp...@marinsoftware.com>.
I've attached the log of the downed node (truffle-solr-4).
This is the relevant log entry from the node it should replicate from
(truffle-solr-5):

[29 Jan 2014 19:31:29] [qtp1614415528-74] ERROR
(org.apache.solr.common.SolrException) -
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for truffle-solr-4:8983_solr but I still do not see the
requested state. I see state: active live:true
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:966)
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:191)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:368)

You can see that 4 is serving queries.  It appears that 4 tries to recover
from 5, but 5 is confused about the state of 4?  4 had an empty index and
tlog when it was started.

We will eventually upgrade to 4.6.x or 4.7.x, but we've got a pretty
extensive regression testing cycle, so there is some delay in upgrading
versions.



-Greg


On Wed, Jan 29, 2014 at 9:08 AM, Mark Miller <ma...@gmail.com> wrote:

> What's in the logs of the node that won't recover on restart after
> clearing the index and tlog
>
> - Mark
>
> On Jan 29, 2014, at 11:41 AM, Greg Preston <gp...@marinsoftware.com>
> wrote:
>
> >> If you removed the tlog and index and restart it should resync, or
> > something is really crazy.
> >
> > It doesn't, or at least if it tries, it's somehow failing.  I'd be ok
> with
> > the sync failing for some reason if the node wasn't also serving queries.
> >
> >
> > -Greg
> >
> >
> >> On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller <ma...@gmail.com>
> wrote:
> >>
> >> Sounds like a bug. 4.6.1 is out any minute - you might try that. There
> was
> >> a replication bug that may be involved.
> >>
> >> If you removed the tlog and index and restart it should resync, or
> >> something is really crazy.
> >>
> >> The clusterstate.json is a red herring. You have to merge the live nodes
> >> info with the state to know the real state.
> >>
> >> - Mark
> >>
> >> http://www.about.me/markrmiller
> >>
> >>>> On Jan 28, 2014, at 12:31 PM, Greg Preston <
> gpreston@marinsoftware.com>
> >>> wrote:
> >>>
> >>> ** Using solrcloud 4.4.0 **
> >>>
> >>> I had to kill a running solrcloud node.  There is still a replica for
> >> that
> >>> shard, so everything is functional.  We've done some indexing while the
> >>> node was killed.
> >>>
> >>> I'd like to bring back up the downed node and have it resync from the
> >> other
> >>> replica.  But when I restart the downed node, it joins back up as
> active
> >>> immediately, and doesn't resync.  I even wiped the data directory on
> the
> >>> downed node, hoping that would force it to sync on restart, but it
> >> doesn't.
> >>>
> >>> I'm assuming this is related to the state still being listed as active
> in
> >>> clusterstate.json for the downed node?  Since it comes back as active,
> >> it's
> >>> serving queries and giving old results.
> >>>
> >>> How can I force this node to do a recovery on restart?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> -Greg
> >>
>

Re: Dead node, but clusterstate.json says active, won't sync on restart

Posted by Mark Miller <ma...@gmail.com>.
What's in the logs of the node that won't recover on restart after clearing the index and tlog 

- Mark

On Jan 29, 2014, at 11:41 AM, Greg Preston <gp...@marinsoftware.com> wrote:

>> If you removed the tlog and index and restart it should resync, or
> something is really crazy.
> 
> It doesn't, or at least if it tries, it's somehow failing.  I'd be ok with
> the sync failing for some reason if the node wasn't also serving queries.
> 
> 
> -Greg
> 
> 
>> On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller <ma...@gmail.com> wrote:
>> 
>> Sounds like a bug. 4.6.1 is out any minute - you might try that. There was
>> a replication bug that may be involved.
>> 
>> If you removed the tlog and index and restart it should resync, or
>> something is really crazy.
>> 
>> The clusterstate.json is a red herring. You have to merge the live nodes
>> info with the state to know the real state.
>> 
>> - Mark
>> 
>> http://www.about.me/markrmiller
>> 
>>>> On Jan 28, 2014, at 12:31 PM, Greg Preston <gp...@marinsoftware.com>
>>> wrote:
>>> 
>>> ** Using solrcloud 4.4.0 **
>>> 
>>> I had to kill a running solrcloud node.  There is still a replica for
>> that
>>> shard, so everything is functional.  We've done some indexing while the
>>> node was killed.
>>> 
>>> I'd like to bring back up the downed node and have it resync from the
>> other
>>> replica.  But when I restart the downed node, it joins back up as active
>>> immediately, and doesn't resync.  I even wiped the data directory on the
>>> downed node, hoping that would force it to sync on restart, but it
>> doesn't.
>>> 
>>> I'm assuming this is related to the state still being listed as active in
>>> clusterstate.json for the downed node?  Since it comes back as active,
>> it's
>>> serving queries and giving old results.
>>> 
>>> How can I force this node to do a recovery on restart?
>>> 
>>> Thanks.
>>> 
>>> 
>>> -Greg
>> 

Re: Dead node, but clusterstate.json says active, won't sync on restart

Posted by Greg Preston <gp...@marinsoftware.com>.
>If you removed the tlog and index and restart it should resync, or
something is really crazy.

It doesn't, or at least if it tries, it's somehow failing.  I'd be ok with
the sync failing for some reason if the node wasn't also serving queries.


-Greg


On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller <ma...@gmail.com> wrote:

> Sounds like a bug. 4.6.1 is out any minute - you might try that. There was
> a replication bug that may be involved.
>
> If you removed the tlog and index and restart it should resync, or
> something is really crazy.
>
> The clusterstate.json is a red herring. You have to merge the live nodes
> info with the state to know the real state.
>
> - Mark
>
> http://www.about.me/markrmiller
>
> > On Jan 28, 2014, at 12:31 PM, Greg Preston <gp...@marinsoftware.com>
> wrote:
> >
> > ** Using solrcloud 4.4.0 **
> >
> > I had to kill a running solrcloud node.  There is still a replica for
> that
> > shard, so everything is functional.  We've done some indexing while the
> > node was killed.
> >
> > I'd like to bring back up the downed node and have it resync from the
> other
> > replica.  But when I restart the downed node, it joins back up as active
> > immediately, and doesn't resync.  I even wiped the data directory on the
> > downed node, hoping that would force it to sync on restart, but it
> doesn't.
> >
> > I'm assuming this is related to the state still being listed as active in
> > clusterstate.json for the downed node?  Since it comes back as active,
> it's
> > serving queries and giving old results.
> >
> > How can I force this node to do a recovery on restart?
> >
> > Thanks.
> >
> >
> > -Greg
>

Re: Dead node, but clusterstate.json says active, won't sync on restart

Posted by Mark Miller <ma...@gmail.com>.
Sounds like a bug. 4.6.1 is out any minute - you might try that. There was a replication bug that may be involved. 

If you removed the tlog and index and restart it should resync, or something is really crazy. 

The clusterstate.json is a red herring. You have to merge the live nodes info with the state to know the real state. 

- Mark

http://www.about.me/markrmiller

> On Jan 28, 2014, at 12:31 PM, Greg Preston <gp...@marinsoftware.com> wrote:
> 
> ** Using solrcloud 4.4.0 **
> 
> I had to kill a running solrcloud node.  There is still a replica for that
> shard, so everything is functional.  We've done some indexing while the
> node was killed.
> 
> I'd like to bring back up the downed node and have it resync from the other
> replica.  But when I restart the downed node, it joins back up as active
> immediately, and doesn't resync.  I even wiped the data directory on the
> downed node, hoping that would force it to sync on restart, but it doesn't.
> 
> I'm assuming this is related to the state still being listed as active in
> clusterstate.json for the downed node?  Since it comes back as active, it's
> serving queries and giving old results.
> 
> How can I force this node to do a recovery on restart?
> 
> Thanks.
> 
> 
> -Greg

Re: Dead node, but clusterstate.json says active, won't sync on restart

Posted by Joel Bernstein <jo...@gmail.com>.
Hi Greg,

Try unloading the core using the core admin screen. Then re-attach the core
to the correct collection and shard in the core admin screen. If it's the
only core in the server the admin screen may not function properly so
you'll have to re-attach using the core admin http api.

Joel Bernstein
Search Engineer at Heliosearch


On Tue, Jan 28, 2014 at 1:33 PM, Greg Preston <gp...@marinsoftware.com>wrote:

> Thanks for the idea.  I tried it, and the state for the bad node, even
> after an orderly shutdown, is still "active" in clusterstate.json.  I see
> this in the logs on restart:
>
> [28 Jan 2014 18:25:29] [RecoveryThread] ERROR
> (org.apache.solr.common.SolrException) - Error while trying to recover.
>
> core=marin:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> I was asked to wait on state recovering for truffle-solr-4:8983_solr but I
> still do not see the requested state. I see state: active live:true
>         at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
>         at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
>         at
>
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
>         at
>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
>         at
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
>
>
>
>
>
> -Greg
>
>
> On Tue, Jan 28, 2014 at 9:53 AM, Shawn Heisey <so...@elyograg.org> wrote:
>
> > On 1/28/2014 10:31 AM, Greg Preston wrote:
> >
> >> ** Using solrcloud 4.4.0 **
> >>
> >> I had to kill a running solrcloud node.  There is still a replica for
> that
> >> shard, so everything is functional.  We've done some indexing while the
> >> node was killed.
> >>
> >> I'd like to bring back up the downed node and have it resync from the
> >> other
> >> replica.  But when I restart the downed node, it joins back up as active
> >> immediately, and doesn't resync.  I even wiped the data directory on the
> >> downed node, hoping that would force it to sync on restart, but it
> >> doesn't.
> >>
> >> I'm assuming this is related to the state still being listed as active
> in
> >> clusterstate.json for the downed node?  Since it comes back as active,
> >> it's
> >> serving queries and giving old results.
> >>
> >> How can I force this node to do a recovery on restart?
> >>
> >
> > This might be completely wrong, but hopefully it will help you: Perhaps a
> > graceful stop of that node will result in the proper clusterstate so it
> > will work the next time it's started? That may already be what you've
> done,
> > so this may not help at all ... but you did say "kill" which might mean
> > that it wasn't a clean shutdown of Solr.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: Dead node, but clusterstate.json says active, won't sync on restart

Posted by Greg Preston <gp...@marinsoftware.com>.
Thanks for the idea.  I tried it, and the state for the bad node, even
after an orderly shutdown, is still "active" in clusterstate.json.  I see
this in the logs on restart:

[28 Jan 2014 18:25:29] [RecoveryThread] ERROR
(org.apache.solr.common.SolrException) - Error while trying to recover.
core=marin:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state recovering for truffle-solr-4:8983_solr but I
still do not see the requested state. I see state: active live:true
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
        at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
        at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
        at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)





-Greg


On Tue, Jan 28, 2014 at 9:53 AM, Shawn Heisey <so...@elyograg.org> wrote:

> On 1/28/2014 10:31 AM, Greg Preston wrote:
>
>> ** Using solrcloud 4.4.0 **
>>
>> I had to kill a running solrcloud node.  There is still a replica for that
>> shard, so everything is functional.  We've done some indexing while the
>> node was killed.
>>
>> I'd like to bring back up the downed node and have it resync from the
>> other
>> replica.  But when I restart the downed node, it joins back up as active
>> immediately, and doesn't resync.  I even wiped the data directory on the
>> downed node, hoping that would force it to sync on restart, but it
>> doesn't.
>>
>> I'm assuming this is related to the state still being listed as active in
>> clusterstate.json for the downed node?  Since it comes back as active,
>> it's
>> serving queries and giving old results.
>>
>> How can I force this node to do a recovery on restart?
>>
>
> This might be completely wrong, but hopefully it will help you: Perhaps a
> graceful stop of that node will result in the proper clusterstate so it
> will work the next time it's started? That may already be what you've done,
> so this may not help at all ... but you did say "kill" which might mean
> that it wasn't a clean shutdown of Solr.
>
> Thanks,
> Shawn
>
>

Re: Dead node, but clusterstate.json says active, won't sync on restart

Posted by Shawn Heisey <so...@elyograg.org>.
On 1/28/2014 10:31 AM, Greg Preston wrote:
> ** Using solrcloud 4.4.0 **
>
> I had to kill a running solrcloud node.  There is still a replica for that
> shard, so everything is functional.  We've done some indexing while the
> node was killed.
>
> I'd like to bring back up the downed node and have it resync from the other
> replica.  But when I restart the downed node, it joins back up as active
> immediately, and doesn't resync.  I even wiped the data directory on the
> downed node, hoping that would force it to sync on restart, but it doesn't.
>
> I'm assuming this is related to the state still being listed as active in
> clusterstate.json for the downed node?  Since it comes back as active, it's
> serving queries and giving old results.
>
> How can I force this node to do a recovery on restart?

This might be completely wrong, but hopefully it will help you: Perhaps 
a graceful stop of that node will result in the proper clusterstate so 
it will work the next time it's started? That may already be what you've 
done, so this may not help at all ... but you did say "kill" which might 
mean that it wasn't a clean shutdown of Solr.

Thanks,
Shawn